Adren's profile

16 Messages

 • 

234 Points

Saturday, September 24th, 2022 4:48 PM

In Progress

Missing entries (tconst) in title.basics.tsv with regard to other files

When checking in the main file (title.basics.tsv) all the tconst (tt....) found in the other files (akas, episodes and principals) there seems to be incoherencies and missing movies or TV shows.

The parent series of some TV show episodes found in the title.episodes.tsv (the parentTconst value) are absent in the title.basics.tsv
For instance, tt0086748 show 8 episodes in the title.episodes.tsv

+-----------+--------------+--------------+---------------+
| tconst    | parentTconst | seasonNumber | episodeNumber |
+-----------+--------------+--------------+---------------+
| tt0630512 | tt0086748    | 1            | 6             |
| tt0630513 | tt0086748    | 1            | 8             |
| tt0630514 | tt0086748    | 1            | 7             |
| tt0630515 | tt0086748    | 1            | 4             |
| tt0630516 | tt0086748    | 1            | 2             |
| tt0630517 | tt0086748    | 1            | 5             |
| tt0630518 | tt0086748    | 1            | 1             |
| tt0630519 | tt0086748    | 1            | 3             |
+-----------+--------------+--------------+---------------+

but tt0086748 is not present on title.basics.tsv

while the TV show exists on IMdb:
https://www.imdb.com/title/tt0086748/

What is even more surprising, is that for some cases the webpage of the parent TV show doesn't exists on the file nor on the website:

+-----------+--------------+--------------+---------------+
| tconst    | parentTconst | seasonNumber | episodeNumber |
+-----------+--------------+--------------+---------------+
| tt7153894 | tt4824012    | 6            | 21            |
+-----------+--------------+--------------+---------------+


tt4824012 is also absent from tile.basics.tsv and it returns a dead page
but tt6781204 shows the episode with a different ID for the serie (tt2912216 instead of tt4824012)


And for a more extreme case:

+-----------+--------------+--------------+---------------+
| tconst    | parentTconst | seasonNumber | episodeNumber |
+-----------+--------------+--------------+---------------+
| tt4839832 | tt4820982    | 1            | 17            |
+-----------+--------------+--------------+---------------+


here, both IDs (episode and parent serie) are absent in the titles.basics and the website
tt4820982 -> 404 Not Found
tt4839832 -> 404 Not Found



In conclusion, from those 3 files:
- title.akas.tsv
- title.episodes.tsv
- title.principals.tsv
when we extract respectively the titleId, parentTconst and tconst and check them against the tconst in title.basics.tsv
there are 6220 missing entries
1086 seem to be valid pages on IMDb website (although 48 are redirections -> HTTP/1.1 308 Redirect)
but 5134 are dead pages (see tt4820982 above)

The files title.ratings and title.crew doesn't seem to have the same problem.

Employee

 • 

5.6K Messages

 • 

58.7K Points

2 years ago

Hi @Adren -

We have reported this to the team in charge and will give you an update as soon as I have one.

Cheers!

16 Messages

 • 

234 Points

2 years ago

[Update 2022-09-29]

The column "knownForTitles" from the character's file (name.basics.tsv), provides a large number of tconst that should also be present in the title.basics.tsv

Unfortunately, there are a large number of IDs that are unavailable such as this 1922 movie (Beyond the Rainbow)
https://www.imdb.com/title/tt0012937/
or the more recent TV show (Joe Scott - TMI)
https://www.imdb.com/title/tt21318080/

both tconst are only present in the names.basics.tsv :

+------------+----------------+-----------------------------------------+
| nconst     | primaryName    | knownForTitles                          |
+------------+----------------+-----------------------------------------+
| nm0150767  | F. Champury    | tt0005330,tt0012937,tt0011624           |
| nm0604384  | Harry T. Morey | tt0010289,tt0012444,tt0185913,tt0012937 |
| nm12640960 | Joe Scott      | tt14797924,tt21318080                   |
+------------+----------------+-----------------------------------------+

but in no other file from the datasets downloaded last week (23 September 2022)

As a conclusion, I found 7895 unique tt (tconst) IMDb absent from the title.basics.tsv file while they are present in one (or many) of those 4 other files (title.akas.tsv title.episodes.tsv title.principals.tsv names.basics)
checked individually, the missing IDs from title.basics are distributed as follows:

name.basics       2578
title.akas        5741
title.crew        0
title.episodes    27
title.principals  928
title.ratings     0


And when those 7895 tconst are checked on the IMDb website, here is the result:
- 1866 are existing films/series/... (returns an "HTTP/1.1 200 OK"), including 535 that are redirections towards another page (308 redirect)
- but an astonishing number doesn't exist: 6029 sends back a "404 Not Found"

(edited)

Employee

 • 

5.6K Messages

 • 

58.7K Points

2 years ago

Hi @Adren -

This has now been fixed.

Cheers!

16 Messages

 • 

234 Points

Hello @Bethanny​ 

Unfortunately, with the files retrieved yesterday (2023-04-26 @ 15:15) there are still thousands of tconst found in both title.akas.tsv and names.basics that are not present in the title.basics.tsv

for instance:

tt0021006
tt0021453
tt0023019
tt0024677
tt0036165
tt0038098
tt0046142
tt0052041
tt0052206
...
tt7766088
tt7779806
tt7829938
tt7869672
tt7892078
tt7978886
tt8206494
tt8466868
tt8982514
tt9496006

from akas are not in the main title.basics

if I look more precisely at the first example (tt0021006)

here is the line in title.akas.tsv

tt0021006 1       Ja, der Himmel über Wien        AT      \N      \N      \N      \N

but there is no corresponding web page

https://www.imdb.com/title/tt0021006/ (404 not found)

on the other hand there is a movie with this particular title for Austria

https://www.imdb.com/title/tt0023449/releaseinfo/#akas

so it should be tt0023449 instead of tt0021006

To conclude, there are still 5201 tconst missing in akas and 2537 in names

I hope this will help

Employee

 • 

5.6K Messages

 • 

58.7K Points

@Adren​ Hi!

Sorry to hear that, I let our team in charge know about this.

Thanks!

16 Messages

 • 

234 Points

2 years ago

If you want to check how many tconst are missing in the titles.basics file compared to the title.akas you can launch the following commands on a Unix system (Linux or else) to get both lists of sorted IDs

cut -d$'\t' -f1 datasets.imdbws.com/title.basics.tsv |sort -u > /tmp/basics_tconst.tsv
cut -d$'\t' -f1 datasets.imdbws.com/title.akas.tsv |sort -u > /tmp/akas_tconst.tsv

and to compute the IDs only in akas but not in basics

comm -13 /tmp/basics_tconst.tsv /tmp/akas_tconst.tsv > /tmp/only_in_akas.tsv

the result is that there are still 5303 missing tconst in the akas file that are not found in basics (files retrieved on the 2023-05-05)

grep -c ^tt /tmp/only_in_akas.tsv

(edited)

16 Messages

 • 

234 Points

8 months ago

Hello @Bethanny 

I just wanted to let you know as of today, the inconsistency is not present anymore in the files.

This is probably due to the change of mentioned in the message/banner explaining that the "datasets are backed by a new data source as of March 18th, 2024".

You can close this ticket.

16 Messages

 • 

234 Points

5 months ago

Hello @Bethanny 

Unfortunately, I have to reopen this ticket as that there is again a large number of missing tconst in all the related (sub)files compared to the reference (title.basics.tsv).

Here is a detailed list of missing tconst in the following files

name.basics       236
title.akas        1854
title.crew        189045
title.episode    1512 (tconst) / 10 (parentTconst)
title.principals  1413
title.ratings     0

I checked the first 10 IDs on the title.crew and they are mostly redirects.

16 Messages

 • 

234 Points

As of today (11th of July 2024), the incoherencies in title.axas and title.episode are fixed. But title.crew still contains nearly 190k tconst missing from title.basics

Here is the update:

name.basics 314
title.akas 0
title.crew 189820
title.episode 0 / 0
title.principals 1310
title.ratings 0

Here are the first missing tconst (order by number) in title.crew that are not present in title.basics

┌───────────┬─────────────────────┬───────────┐ 
│  tconst   │      directors      │  writers  │
├───────────┼─────────────────────┼───────────┤
│ tt0000021 │ nm0525910           │           │
│ tt0000136 │ nm0525910           │           │
│ tt0000311 │                     │           │
│ tt0000600 │ nm0488932           │ nm0241414 │
│ tt0000635 │ nm0085865,nm0448682 │ nm0000636 │
└───────────┴─────────────────────┴───────────┘

(all of them are redirections to another page)

(edited)

Employee

 • 

17.5K Messages

 • 

313.1K Points

22 days ago

Hi @Adren -

I wanted to follow-up on this older thread to conform if you are still observing issues with outstanding missing data in the title.basics dataset?

16 Messages

 • 

234 Points

Hi @Michelle​ 

Only the issue with title.principals have been corrected some weeks ago.

As for the other problems, there are still today 192570 tconst in the title.crew.tsv that are not present in the title.basics.tsv file

┌────────────────────────────────────────┬───────────────────────────────┐
│                 URL_tt                 │           directors           │
├────────────────────────────────────────┼───────────────────────────────┤
│ https://www.imdb.com/title/tt0000021/  │ nm0525910                     │
│ https://www.imdb.com/title/tt0000136/  │ nm0525910                     │
│ https://www.imdb.com/title/tt0000311/  │                               │
│ https://www.imdb.com/title/tt0000600/  │ nm0488932                     │
│ https://www.imdb.com/title/tt0000635/  │ nm0085865,nm0448682           │
│ https://www.imdb.com/title/tt0000702/  │ nm0159015                     │
│ https://www.imdb.com/title/tt0000710/  │ nm0085865,nm0710362           │
│ https://www.imdb.com/title/tt0000735/  │ nm0143333,nm0892614           │
│ https://www.imdb.com/title/tt0000937/  │ nm0000428                     │
│ https://www.imdb.com/title/tt0000973/  │ nm0000428                     │
│ https://www.imdb.com/title/tt0001433/  │ nm0085865                     │
│ https://www.imdb.com/title/tt0001651/  │ nm0159015                     │
│ https://www.imdb.com/title/tt0001745/  │ nm0048864                     │
│ https://www.imdb.com/title/tt0001938/  │ nm0000428                     │
│ https://www.imdb.com/title/tt0001953/  │ nm0135052                     │
│ https://www.imdb.com/title/tt0001958/  │ nm0309163                     │
│ https://www.imdb.com/title/tt0001991/  │ nm0300487                     │
│ https://www.imdb.com/title/tt0002032/  │ nm0085865,nm0448682,nm0949648 │
│ https://www.imdb.com/title/tt0002275/  │ nm0408436                     │
│ https://www.imdb.com/title/tt0002957/  │ nm0102643                     │
│                   ·                    │     ·                         │
│                   ·                    │     ·                         │
│                   ·                    │     ·                         │
│ https://www.imdb.com/title/tt34232952/ │ nm16643084,nm6354377          │
│ https://www.imdb.com/title/tt34235957/ │                               │
│ https://www.imdb.com/title/tt34241255/ │                               │
│ https://www.imdb.com/title/tt34241258/ │                               │
│ https://www.imdb.com/title/tt34241261/ │                               │
│ https://www.imdb.com/title/tt34241262/ │                               │
│ https://www.imdb.com/title/tt34241268/ │                               │
│ https://www.imdb.com/title/tt34259430/ │ nm2980216                     │
│ https://www.imdb.com/title/tt34267400/ │                               │
│ https://www.imdb.com/title/tt34280336/ │                               │
│ https://www.imdb.com/title/tt34281236/ │ nm9742632                     │
│ https://www.imdb.com/title/tt34281469/ │ nm0333132,nm6091692           │
│ https://www.imdb.com/title/tt34284619/ │                               │
│ https://www.imdb.com/title/tt34286337/ │ nm1414582                     │
│ https://www.imdb.com/title/tt34316127/ │ nm10245362                    │
│ https://www.imdb.com/title/tt34316310/ │ nm15705740,nm15384155         │
│ https://www.imdb.com/title/tt34322279/ │ nm0591101                     │
│ https://www.imdb.com/title/tt34338732/ │                               │
│ https://www.imdb.com/title/tt34340098/ │ nm15281673                    │
│ https://www.imdb.com/title/tt34376056/ │                               │
├────────────────────────────────────────┴───────────────────────────────┤
│ 192570 rows (40 shown)                                       2 columns │
└────────────────────────────────────────────────────────────────────────┘

After a quick check, all the IDs tested appear to be redirected

Here are the four first tconst


https://www.imdb.com/title/tt0000021/ -> https://www.imdb.com/title/tt0000013/
https://www.imdb.com/title/tt0000136/ -> https://www.imdb.com/title/tt0000014/
https://www.imdb.com/title/tt0000311/ -> https://www.imdb.com/title/tt0000265/
https://www.imdb.com/title/tt0000600/ -> https://www.imdb.com/title/tt0000583/

and the last ones in chronological/ranking order
https://www.imdb.com/title/tt34322279/ -> https://www.imdb.com/title/tt33550053/
https://www.imdb.com/title/tt34338732/ -> https://www.imdb.com/title/tt28255955/
https://www.imdb.com/title/tt34340098/ -> https://www.imdb.com/title/tt34326340/
https://www.imdb.com/title/tt34376056/ -> https://www.imdb.com/title/tt34376050/

I haven't had the time to check them all systematically, but I'm sure that over the past months, the number of missing tconst from title.crew has always been above 190k IDs with most if not all that redirects to another page (http code 308 / Permanent Redirect)


Another remaining incoherency found in the name.basics.tsv file is that among the knownForTitles column, 14 of them to not relate with the tconst in the title.basics

https://www.imdb.com/title/tt4864946/ -> 404 (page not found)
https://www.imdb.com/title/tt8170096/ -> the page exists on wwwimdb.com, but tconst is missing from title.basics
https://www.imdb.com/title/tt9174576/ -> redirects to https://www.imdb.com/title/tt8694398/
https://www.imdb.com/title/tt9745406/ -> 404 (page not found)
https://www.imdb.com/title/tt11127492/ -> 404 (page not found)
https://www.imdb.com/title/tt11670206/ -> 404 (page not found)
https://www.imdb.com/title/tt14598938/ -> exists on wwwimdb.com, but tconst is missing from title.basics
https://www.imdb.com/title/tt22183320/ -> 404 (page not found)
https://www.imdb.com/title/tt27243323/ -> 404 (page not found)
https://www.imdb.com/title/tt29332867/ -> 404 (page not found)
https://www.imdb.com/title/tt29332868/ -> 404 (page not found)
https://www.imdb.com/title/tt31012569/ -> 404 (page not found)
https://www.imdb.com/title/tt32339746/ -> 404 (page not found)

NB
The files on which those numbers are calculated are all dated 2024-10-30 with title.basics.tsv having 11201578 lines (11201577 unique tconst)

Thank you very much following-up this issue

(edited)

Employee

 • 

17.5K Messages

 • 

313.1K Points

Hi @Adren​ -

Thanks for providing these additional details and confirming the existing outstanding issues.  I filed a new ticket (#P169319761) for the applicable tech team to review.  I will be sure to update you here on the progress.