johnny_m's profile

7 Messages

 • 

150 Points

Tuesday, July 26th, 2022 8:09 AM

No Status

title.basics.tsv.gz is broken - https://datasets.imdbws.com/

The title.basics.tsv.gz dataset is broken in https://datasets.imdbws.com/ .

It now only includes 3,477,496 titles. It should have 3 times that number almost.

The data is corrupted after the title "Kneeling for Justice: A San Francisco Memorial for George Floyd". This value is found in titleType. The value in tconst for that record is "ial for George Floyd". 

Could some at IMDb please correct this?

Thank you!

10 Messages

 • 

230 Points

2 years ago

There are tconst entries that appear to be missing from titles.basic.tsv.  The entry tt0055928 , "Dr. No", was missing for a few days, but is present once again.  Here is are some missing entries of which I am aware, but there could be others as well.  Nor do these tconst values have any entries in the title.crew.tsv, title.ratings.tsv, or title.episode.tsv files. 

 tt0562856
 tt0562972
 tt0811753
 tt0811802
 tt0811803
 tt0811804
 tt10756720
 tt10927782
 tt10927786
 tt10927788
 tt10927792
 tt10927794
 tt11252880
 tt11301906
 tt11669800
 tt12919806
 tt12919828
 tt13056134
 tt13056158
 tt13286384
 tt13286388
 tt13422242
 tt13422246
 tt13431146
 tt13675380
 tt13675384
 tt13825212
 tt14739882
 tt14883030
 tt14921134
 tt15130830
 tt15588572
 tt15806014
 tt1747582
 tt1752041
 tt1771906
 tt18231284
 tt18258626
 tt4462678
 tt6579350
 tt8084176
 tt8893678

Note: This comment was created from a merged conversation originally titled Some tconst titles missing from titles.basic.tsv

2 Messages

 • 

70 Points

2 years ago

Sometime over the last two-three weeks (Between files downloaded on 2022-07-10 and 2022-07-24), it seems as if the IMDB datasets available from https://datasets.imdbws.com/ no longer include some movies.

Download https://datasets.imdbws.com/title.basics.tsv.gz for instance, and try to find the following IMDB-ids entries, there on July 10 but not on July 24:

tt0044502  Clash by Night (1952)
tt0047573  Them! (1954)
tt0048977  The Bad Seed (1956)
tt0050539  The Incredible Shrinking Man (1957)
tt0053290  Solomon and Sheba (1959)
tt0056700  The Wonderful World of the Brothers Grimm (1962)
tt0057449  The Raven (1963)
tt0060980  The Silencers (1966)
tt0065421  The AristoCats (1970)

The same IMDB-ids seem to have disappeared from https://datasets.imdbws.com/title.ratings.tsv.gz as well.

I did re-download the files on July 25 and got the same results missing.

What could explain this?

Note: This comment was created from a merged conversation originally titled IMDB Datasets no longer including some movies?

7 Messages

 • 

150 Points

The dataset is broken. It now only includes 3,477,496 titles. It should have 3 times that number almost.

The data is corrupted after the title "Kneeling for Justice: A San Francisco Memorial for George Floyd". The value in tconst for that character is "ial for George Floyd".

Could some at IMDb please correct this?

Thank you!

7 Messages

 • 

150 Points

2 years ago

The source of the issue might be another record. See tconst = tt14491350. The value in genre contains the value for tconst of another record.

Employee

 • 

17K Messages

 • 

308.8K Points

2 years ago

Hi @johnny_m & @christian_sauve -

Thanks for reporting the issue with the datasets.  I have filed a ticket for the appropriate team to investigate further.  As soon as I have an update on the status I will relay that information here.

Thanks again!

Employee

 • 

17K Messages

 • 

308.8K Points

2 years ago

Hi All -

I'm just following up here to confirm that the issue with the 'title.basics.tsv.gz' dataset should now be resolved and the titles should now be included.

Cheers!

10 Messages

 • 

230 Points

@Michelle

I'm replying about a smaller issue that still exists with the IMDB datasets available from https://datasets.imdbws.com/.

Several titles remain missing from various files there.

Specifically:

tt8084176 -- "Mr. Robot"; Season 4, Episode 7; "407 Proxy Authentication Required" is available via the web UI, but is not present in any of title.basics.tsv, title.ratings.tsv, title.crew.tsv, or title.episodes.tsv.

tt0562856, tt0811802, tt0811803, tt0811804, tt0562972, and tt0811753 -- "Doctor Who (1963)"; Season 19, Episodes 9-14 are available via the web UI, but are not present in any of title.basics.tsv, title.ratings.tsv, title.crew.tsv, or title.episodes.tsv.

On Sat, Jul 23, 2022, I mentioned a longer list of titles that are/were missing in a post titled "Some tconst titles missing from titles.basic.tsv".

Thanks!

Employee

 • 

17K Messages

 • 

308.8K Points

Hi @hduston​ -

Thanks for these additional missing title reports, I have filed a new ticket for the appropriate team to investigate further.

15 Messages

 • 

224 Points

2 years ago

Here are some additional info which could help find the source of the problem that still persists

The missing IDs in the TSV files can be as old as the 1923 movie "The Hunchback of Notre Dame" (tt0014142) or "The Wild Child" (tt0064285) from François Truffaut 1970, as well as more recent ones such as "Feast" (tt13097910) or Sinkhole (tt21953638) both released in 2021.

There are also some TV shows in that list (ex: "Norman" tt4191702)

What is peculiar is that some IDs can be found on the title.akas and title.principals or even only on the name.basics without appearing in the main title.basics.

After doing a crosscheck between what appears to be the main title.basics and the 4 "sub-files" (name.basics, title.akas, title.crew and title.principals), there seems to be 8712 incoherencies (mismatch) organized in 2 categories:

  • 2670 IMDb IDs (tconst) correspond to an existing movie/TV show/video/... on the website with 2130 of them that land to a regular page (returns HTTP 200 code) and 540 are 302 redirects to another id.
  • 6042 are inexistent when checking on the imdb.com website (Not found / 404 HTTP error code).


here are some examples of the IDs found (top/bottom five for each group):

- IDs absent in title.basics but correspond to an existing movie/serie

tt0012182
tt0012852
tt0012937
tt0013743
tt0014142
...
tt21953306
tt21953412
tt21953604
tt21953610
tt21953638

total count = 2130


- IDs absent in title.basics but correspond to an existing movie/serie after being redirected (302)

tt0014327 -> tt0014325
tt0047941-> tt0047940
tt0059860 -> tt0059845
tt0088641 -> tt0085111
tt0103358 -> tt0103357
...
tt21312196 -> tt14604694
tt21336160 -> tt7227442
tt21931516 -> tt21905038
tt21943026 -> tt21926422
tt21944362 -> tt14794336

total count = 540


- IDs found in one of the 4 "sub-files" but not in title.basic and which returns a 404 error in the imdb.com website

tt0021006
tt0021453
tt0023019
tt0024677
tt0036165
...
tt20412466
tt20877586
tt21327936
tt21809398
tt21952054

total count = 6042


I can provide the full list if need be.

(edited)

Employee

 • 

17K Messages

 • 

308.8K Points

Hi @Adren​ -

Thanks for this additional information, I have passed this along to the technical team!

Employee

 • 

5K Messages

 • 

53.3K Points

1 year ago

Hello everyone-

This has now been solved.

Cheers!

1 Message

 • 

60 Points

17 days ago

I think the problem has re-appeared: title.basics.tsv.gz downloaded July 9, 2024, does not seem to have Oppenheimer (tt15398776) for example, as well as other major movies.

15 Messages

 • 

224 Points

As of today (10th July 2024 with 4 files still dated 2024-07-09), the problem seems to be fixed with the title.basics.tsv that has more than 109M lines.

But there are still some incoherencies between the files with tconst that are missing from the title.basics (see this other ticket)