Skip to main content
Vincent_Fournols's profile

Sat, Apr 6, 2019 4:14 PM

Uncomplete datasets?

Hi,

I uploaded IMDb datasets as of 30th March 2019, timestamped around noon GMT.
I understand title.basics to be the backbone of the sets, listing all titles.
But I notice in this version that the last one is tt9916896, when I am positive that IMDb has crossed the 10,000,000 threshold in February, or maybe March.

Is IMDb aware of this?

Responses

Champion

 • 

825 Messages

 • 

46.1K Points

2 years ago

Hey, Vincent:

Note that, despite the ttconsts contain a numerical part, they are sorted alphabetically...




1.8K Messages

 • 

55.3K Points

grazie mille 

2.4K Messages

 • 

81.1K Points

2 years ago

Silly of me, and now I think you have already pointed that out!
Nevertheless, it seems there is a big gap between 9,916,896 and 10,000,000
Anyway, muchas gracias ;)

Champion

 • 

825 Messages

 • 

46.1K Points

...there is a big gap between 9,916,896 and 10,000,000
You're right. Currently that's the largest gap (84,103 blank pages) between two tconsts on .

As of today's afternoon the highest tconst was tt10136648 whereas there were 5,769,583 unique titles, so around 43% of the tconsts are unoccupied. Well, that's only partially true because some of them can actually redirect to a higher tconst after a merging, so are not really "empty". However, since all titles added since a date some years ago have even tt numbers (from tt2404814 onwards) most of them are in fact blank title pages. Moreover, the most frequent distance between two tconsts is 2 (gap=1; 56.7%) while only 2,212,421 tconsts are consecutive to the previous one (gap=0; 38.3%).

That also explains that larger-than-1 gaps between tconsts (as a result of delete/merge processes) are mostly odd numbers (even[hi]-even[lo]-1=odd), as can be seen in the following graph (note the log-10 scale for the Y-axis):


295 Messages

 • 

12.5K Points

The gap leading up to the switch to eight digit numbers may be because the sequence that allocates numbers was updated manually at a time when the relevant staff would be around to make sure there were no issues when the tconsts became longer. I suspect that the software team and the data editors had a carefully planned event where the sequence was updated and then the flow of the incoming titles was watched very carefully to ensure that everything was working correctly.