johnny_m's profile

6 Messages

 • 

130 Points

Tue, Jul 26, 2022 8:09 AM

In Progress

title.basics.tsv.gz is broken - https://datasets.imdbws.com/

The title.basics.tsv.gz dataset is broken in https://datasets.imdbws.com/ .

It now only includes 3,477,496 titles. It should have 3 times that number almost.

The data is corrupted after the title "Kneeling for Justice: A San Francisco Memorial for George Floyd". This value is found in titleType. The value in tconst for that record is "ial for George Floyd". 

Could some at IMDb please correct this?

Thank you!

3 Messages

 • 

80 Points

Il y a 2 m

There are tconst entries that appear to be missing from titles.basic.tsv.  The entry tt0055928 , "Dr. No", was missing for a few days, but is present once again.  Here is are some missing entries of which I am aware, but there could be others as well.  Nor do these tconst values have any entries in the title.crew.tsv, title.ratings.tsv, or title.episode.tsv files. 

 tt0562856
 tt0562972
 tt0811753
 tt0811802
 tt0811803
 tt0811804
 tt10756720
 tt10927782
 tt10927786
 tt10927788
 tt10927792
 tt10927794
 tt11252880
 tt11301906
 tt11669800
 tt12919806
 tt12919828
 tt13056134
 tt13056158
 tt13286384
 tt13286388
 tt13422242
 tt13422246
 tt13431146
 tt13675380
 tt13675384
 tt13825212
 tt14739882
 tt14883030
 tt14921134
 tt15130830
 tt15588572
 tt15806014
 tt1747582
 tt1752041
 tt1771906
 tt18231284
 tt18258626
 tt4462678
 tt6579350
 tt8084176
 tt8893678

Note: This comment was created from a merged conversation originally titled Some tconst titles missing from titles.basic.tsv

2 Messages

 • 

70 Points

Il y a 2 m

Sometime over the last two-three weeks (Between files downloaded on 2022-07-10 and 2022-07-24), it seems as if the IMDB datasets available from https://datasets.imdbws.com/ no longer include some movies.

Download https://datasets.imdbws.com/title.basics.tsv.gz for instance, and try to find the following IMDB-ids entries, there on July 10 but not on July 24:

tt0044502  Clash by Night (1952)
tt0047573  Them! (1954)
tt0048977  The Bad Seed (1956)
tt0050539  The Incredible Shrinking Man (1957)
tt0053290  Solomon and Sheba (1959)
tt0056700  The Wonderful World of the Brothers Grimm (1962)
tt0057449  The Raven (1963)
tt0060980  The Silencers (1966)
tt0065421  The AristoCats (1970)

The same IMDB-ids seem to have disappeared from https://datasets.imdbws.com/title.ratings.tsv.gz as well.

I did re-download the files on July 25 and got the same results missing.

What could explain this?

Note: This comment was created from a merged conversation originally titled IMDB Datasets no longer including some movies?

6 Messages

 • 

130 Points

The dataset is broken. It now only includes 3,477,496 titles. It should have 3 times that number almost.

The data is corrupted after the title "Kneeling for Justice: A San Francisco Memorial for George Floyd". The value in tconst for that character is "ial for George Floyd".

Could some at IMDb please correct this?

Thank you!

6 Messages

 • 

130 Points

Il y a 2 m

The source of the issue might be another record. See tconst = tt14491350. The value in genre contains the value for tconst of another record.

Employee

 • 

14.4K Messages

 • 

281K Points

Il y a 2 m

Hi @johnny_m & @christian_sauve -

Thanks for reporting the issue with the datasets.  I have filed a ticket for the appropriate team to investigate further.  As soon as I have an update on the status I will relay that information here.

Thanks again!

Employee

 • 

14.4K Messages

 • 

281K Points

Il y a 2 m

Hi All -

I'm just following up here to confirm that the issue with the 'title.basics.tsv.gz' dataset should now be resolved and the titles should now be included.

Cheers!

3 Messages

 • 

80 Points

@Michelle

I'm replying about a smaller issue that still exists with the IMDB datasets available from https://datasets.imdbws.com/.

Several titles remain missing from various files there.

Specifically:

tt8084176 -- "Mr. Robot"; Season 4, Episode 7; "407 Proxy Authentication Required" is available via the web UI, but is not present in any of title.basics.tsv, title.ratings.tsv, title.crew.tsv, or title.episodes.tsv.

tt0562856, tt0811802, tt0811803, tt0811804, tt0562972, and tt0811753 -- "Doctor Who (1963)"; Season 19, Episodes 9-14 are available via the web UI, but are not present in any of title.basics.tsv, title.ratings.tsv, title.crew.tsv, or title.episodes.tsv.

On Sat, Jul 23, 2022, I mentioned a longer list of titles that are/were missing in a post titled "Some tconst titles missing from titles.basic.tsv".

Thanks!

Employee

 • 

14.4K Messages

 • 

281K Points

Hi @hduston​ -

Thanks for these additional missing title reports, I have filed a new ticket for the appropriate team to investigate further.

3 Messages

 • 

100 Points

Il y a 20 d

Here are some additional info which could help find the source of the problem that still persists

The missing IDs in the TSV files can be as old as the 1923 movie "The Hunchback of Notre Dame" (tt0014142) or "The Wild Child" (tt0064285) from François Truffaut 1970, as well as more recent ones such as "Feast" (tt13097910) or Sinkhole (tt21953638) both released in 2021.

There are also some TV shows in that list (ex: "Norman" tt4191702)

What is peculiar is that some IDs can be found on the title.akas and title.principals or even only on the name.basics without appearing in the main title.basics.

After doing a crosscheck between what appears to be the main title.basics and the 4 "sub-files" (name.basics, title.akas, title.crew and title.principals), there seems to be 8712 incoherencies (mismatch) organized in 2 categories:

  • 2670 IMDb IDs (tconst) correspond to an existing movie/TV show/video/... on the website with 2130 of them that land to a regular page (returns HTTP 200 code) and 540 are 302 redirects to another id.
  • 6042 are inexistent when checking on the imdb.com website (Not found / 404 HTTP error code).


here are some examples of the IDs found (top/bottom five for each group):

- IDs absent in title.basics but correspond to an existing movie/serie

tt0012182
tt0012852
tt0012937
tt0013743
tt0014142
...
tt21953306
tt21953412
tt21953604
tt21953610
tt21953638

total count = 2130


- IDs absent in title.basics but correspond to an existing movie/serie after being redirected (302)

tt0014327 -> tt0014325
tt0047941-> tt0047940
tt0059860 -> tt0059845
tt0088641 -> tt0085111
tt0103358 -> tt0103357
...
tt21312196 -> tt14604694
tt21336160 -> tt7227442
tt21931516 -> tt21905038
tt21943026 -> tt21926422
tt21944362 -> tt14794336

total count = 540


- IDs found in one of the 4 "sub-files" but not in title.basic and which returns a 404 error in the imdb.com website

tt0021006
tt0021453
tt0023019
tt0024677
tt0036165
...
tt20412466
tt20877586
tt21327936
tt21809398
tt21952054

total count = 6042


I can provide the full list if need be.

(edited)

Employee

 • 

14.4K Messages

 • 

281K Points

Hi @Adren​ -

Thanks for this additional information, I have passed this along to the technical team!