6 Messages
•
130 Points
title.basics.tsv.gz is broken - https://datasets.imdbws.com/
The title.basics.tsv.gz dataset is broken in https://datasets.imdbws.com/ .
It now only includes 3,477,496 titles. It should have 3 times that number almost.
The data is corrupted after the title "Kneeling for Justice: A San Francisco Memorial for George Floyd". This value is found in titleType. The value in tconst for that record is "ial for George Floyd".
Could some at IMDb please correct this?
Thank you!
hduston
3 Messages
•
80 Points
10 months ago
There are tconst entries that appear to be missing from titles.basic.tsv. The entry tt0055928 , "Dr. No", was missing for a few days, but is present once again. Here is are some missing entries of which I am aware, but there could be others as well. Nor do these tconst values have any entries in the title.crew.tsv, title.ratings.tsv, or title.episode.tsv files.
tt0562856
tt0562972
tt0811753
tt0811802
tt0811803
tt0811804
tt10756720
tt10927782
tt10927786
tt10927788
tt10927792
tt10927794
tt11252880
tt11301906
tt11669800
tt12919806
tt12919828
tt13056134
tt13056158
tt13286384
tt13286388
tt13422242
tt13422246
tt13431146
tt13675380
tt13675384
tt13825212
tt14739882
tt14883030
tt14921134
tt15130830
tt15588572
tt15806014
tt1747582
tt1752041
tt1771906
tt18231284
tt18258626
tt4462678
tt6579350
tt8084176
tt8893678
0
0
christian_sauve
2 Messages
•
70 Points
10 months ago
Sometime over the last two-three weeks (Between files downloaded on 2022-07-10 and 2022-07-24), it seems as if the IMDB datasets available from https://datasets.imdbws.com/ no longer include some movies.
Download https://datasets.imdbws.com/title.basics.tsv.gz for instance, and try to find the following IMDB-ids entries, there on July 10 but not on July 24:
tt0044502 Clash by Night (1952)
tt0047573 Them! (1954)
tt0048977 The Bad Seed (1956)
tt0050539 The Incredible Shrinking Man (1957)
tt0053290 Solomon and Sheba (1959)
tt0056700 The Wonderful World of the Brothers Grimm (1962)
tt0057449 The Raven (1963)
tt0060980 The Silencers (1966)
tt0065421 The AristoCats (1970)
The same IMDB-ids seem to have disappeared from https://datasets.imdbws.com/title.ratings.tsv.gz as well.
I did re-download the files on July 25 and got the same results missing.
What could explain this?
1
0
johnny_m
6 Messages
•
130 Points
10 months ago
The source of the issue might be another record. See tconst = tt14491350. The value in genre contains the value for tconst of another record.
0
0
Michelle
Employee
•
15.3K Messages
•
291K Points
10 months ago
Hi @johnny_m & @christian_sauve -
Thanks for reporting the issue with the datasets. I have filed a ticket for the appropriate team to investigate further. As soon as I have an update on the status I will relay that information here.
Thanks again!
0
Michelle
Employee
•
15.3K Messages
•
291K Points
10 months ago
Hi All -
I'm just following up here to confirm that the issue with the 'title.basics.tsv.gz' dataset should now be resolved and the titles should now be included.
Cheers!
2
Adren
6 Messages
•
134 Points
9 months ago
Here are some additional info which could help find the source of the problem that still persists
The missing IDs in the TSV files can be as old as the 1923 movie "The Hunchback of Notre Dame" (tt0014142) or "The Wild Child" (tt0064285) from François Truffaut 1970, as well as more recent ones such as "Feast" (tt13097910) or Sinkhole (tt21953638) both released in 2021.
There are also some TV shows in that list (ex: "Norman" tt4191702)
What is peculiar is that some IDs can be found on the title.akas and title.principals or even only on the name.basics without appearing in the main title.basics.
After doing a crosscheck between what appears to be the main title.basics and the 4 "sub-files" (name.basics, title.akas, title.crew and title.principals), there seems to be 8712 incoherencies (mismatch) organized in 2 categories:
here are some examples of the IDs found (top/bottom five for each group):
- IDs absent in title.basics but correspond to an existing movie/serie
total count = 2130
- IDs absent in title.basics but correspond to an existing movie/serie after being redirected (302)
total count = 540
- IDs found in one of the 4 "sub-files" but not in title.basic and which returns a 404 error in the imdb.com website
total count = 6042
I can provide the full list if need be.
(edited)
1
Bethanny
Employee
•
2.4K Messages
•
25.5K Points
2 months ago
Hello everyone-
This has now been solved.
Cheers!
0
0