16 Messages
•
234 Points
Missing entries (tconst) in title.basics.tsv with regard to other files
When checking in the main file (title.basics.tsv) all the tconst (tt....) found in the other files (akas, episodes and principals) there seems to be incoherencies and missing movies or TV shows.
The parent series of some TV show episodes found in the title.episodes.tsv (the parentTconst value) are absent in the title.basics.tsv
For instance, tt0086748 show 8 episodes in the title.episodes.tsv
+-----------+--------------+--------------+---------------+
| tconst | parentTconst | seasonNumber | episodeNumber |
+-----------+--------------+--------------+---------------+
| tt0630512 | tt0086748 | 1 | 6 |
| tt0630513 | tt0086748 | 1 | 8 |
| tt0630514 | tt0086748 | 1 | 7 |
| tt0630515 | tt0086748 | 1 | 4 |
| tt0630516 | tt0086748 | 1 | 2 |
| tt0630517 | tt0086748 | 1 | 5 |
| tt0630518 | tt0086748 | 1 | 1 |
| tt0630519 | tt0086748 | 1 | 3 |
+-----------+--------------+--------------+---------------+
but tt0086748 is not present on title.basics.tsv
while the TV show exists on IMdb:
https://www.imdb.com/title/tt0086748/
What is even more surprising, is that for some cases the webpage of the parent TV show doesn't exists on the file nor on the website:
+-----------+--------------+--------------+---------------+
| tconst | parentTconst | seasonNumber | episodeNumber |
+-----------+--------------+--------------+---------------+
| tt7153894 | tt4824012 | 6 | 21 |
+-----------+--------------+--------------+---------------+
tt4824012 is also absent from tile.basics.tsv and it returns a dead page
but tt6781204 shows the episode with a different ID for the serie (tt2912216 instead of tt4824012)
And for a more extreme case:
+-----------+--------------+--------------+---------------+
| tconst | parentTconst | seasonNumber | episodeNumber |
+-----------+--------------+--------------+---------------+
| tt4839832 | tt4820982 | 1 | 17 |
+-----------+--------------+--------------+---------------+
here, both IDs (episode and parent serie) are absent in the titles.basics and the website
tt4820982 -> 404 Not Found
tt4839832 -> 404 Not Found
In conclusion, from those 3 files:
- title.akas.tsv
- title.episodes.tsv
- title.principals.tsv
when we extract respectively the titleId, parentTconst and tconst and check them against the tconst in title.basics.tsv
there are 6220 missing entries
1086 seem to be valid pages on IMDb website (although 48 are redirections -> HTTP/1.1 308 Redirect)
but 5134 are dead pages (see tt4820982 above)
The files title.ratings and title.crew doesn't seem to have the same problem.
Bethanny
Employee
•
5.6K Messages
•
58.7K Points
2 years ago
Hi @Adren -
We have reported this to the team in charge and will give you an update as soon as I have one.
Cheers!
0
Adren
16 Messages
•
234 Points
2 years ago
[Update 2022-09-29]
The column "knownForTitles" from the character's file (name.basics.tsv), provides a large number of tconst that should also be present in the title.basics.tsv
Unfortunately, there are a large number of IDs that are unavailable such as this 1922 movie (Beyond the Rainbow)
https://www.imdb.com/title/tt0012937/
or the more recent TV show (Joe Scott - TMI)
https://www.imdb.com/title/tt21318080/
both tconst are only present in the names.basics.tsv :
but in no other file from the datasets downloaded last week (23 September 2022)
As a conclusion, I found 7895 unique tt (tconst) IMDb absent from the title.basics.tsv file while they are present in one (or many) of those 4 other files (title.akas.tsv title.episodes.tsv title.principals.tsv names.basics)
checked individually, the missing IDs from title.basics are distributed as follows:
And when those 7895 tconst are checked on the IMDb website, here is the result:
- 1866 are existing films/series/... (returns an "HTTP/1.1 200 OK"), including 535 that are redirections towards another page (308 redirect)
- but an astonishing number doesn't exist: 6029 sends back a "404 Not Found"
(edited)
0
Bethanny
Employee
•
5.6K Messages
•
58.7K Points
2 years ago
Hi @Adren -
This has now been fixed.
Cheers!
2
0
Adren
16 Messages
•
234 Points
2 years ago
If you want to check how many tconst are missing in the titles.basics file compared to the title.akas you can launch the following commands on a Unix system (Linux or else) to get both lists of sorted IDs
and to compute the IDs only in akas but not in basics
the result is that there are still 5303 missing tconst in the akas file that are not found in basics (files retrieved on the 2023-05-05)
(edited)
0
0
Adren
16 Messages
•
234 Points
8 months ago
Hello @Bethanny
I just wanted to let you know as of today, the inconsistency is not present anymore in the files.
This is probably due to the change of mentioned in the message/banner explaining that the "datasets are backed by a new data source as of March 18th, 2024".
You can close this ticket.
0
Adren
16 Messages
•
234 Points
5 months ago
Hello @Bethanny
Unfortunately, I have to reopen this ticket as that there is again a large number of missing tconst in all the related (sub)files compared to the reference (title.basics.tsv).
Here is a detailed list of missing tconst in the following files
I checked the first 10 IDs on the title.crew and they are mostly redirects.
1
0
Michelle
Employee
•
17.5K Messages
•
313.1K Points
22 days ago
Hi @Adren -
I wanted to follow-up on this older thread to conform if you are still observing issues with outstanding missing data in the title.basics dataset?
2
0