Please note that regarding the database files published here: https://datasets.imdbws.com/ The genre data is missing from the title.basics.tsv.gz file.
Philip Persson Joined on July 11, 2020 Posted July 11 2020 .
Thanks for the update and the new dataset looks good! From Phil's comments though it seems like a recurring problem that the IMDb team can perhaps pay more attention to.
Also, I hate to diverge from the topic but from checking further into the data I noticed another interesting issue: many titles with "tconst" ending in 9 have the number "9" is truncated.
Here are a few examples:
The correct tconst for Imeji should be tt10925129 not tt1092512, and
The correct tconst for Ikinari wa kawarenai should be tt10925119 not tt1092511
I think this is a misunderstanding on how the data is presented.
If we look at Imeji on IMDb, you can see it's tconst is actually tt1092512 and not tt10925129.
Similarly, if we look at Ikinari wa kawarenai on IMDbthe tconst is tt1092511 and not tt10925119.
I'm guessing this assumption is (understandably) because of its position in the file (i.e. between tt10925128 and tt10925130). - but it's actually just because the consts are ordered as text not a number.
Put another way, the tconst are character strings not numbers. Character strings can only be sorted in alphabetic (i.e. lexagraphic) order and are not sorted in numeric order.
ACT_1
8.8K Messages
•
179.5K Points
5 years ago
JEAN LIU
Joined community on July 12, 2020 - new today
- - -
0
philip_persson_9klwyrk6z2308
3 Messages
•
192 Points
5 years ago
Please note that regarding the database files published here: https://datasets.imdbws.com/
The genre data is missing from the title.basics.tsv.gz file.
0
Joel
Employee
•
1.2K Messages
•
36.3K Points
5 years ago
Thanks for the post.
I've cut a ticket to the necessary team to look into for you.
Thanks,
Joel
0
Joel
Employee
•
1.2K Messages
•
36.3K Points
5 years ago
Similar to the other conversation, I'd recommend giving this file another download and taking a look to see if the problems no longer present.
Sometimes there are problems with data generation which is what you might be seeing.
Let me know!
Cheers,
Joel
4
jean_liu
5 Messages
•
264 Points
5 years ago
Thanks for the update and the new dataset looks good! From Phil's comments though it seems like a recurring problem that the IMDb team can perhaps pay more attention to.
Also, I hate to diverge from the topic but from checking further into the data I noticed another interesting issue: many titles with "tconst" ending in 9 have the number "9" is truncated.
Here are a few examples:
tt10925128 tvEpisode Episode dated 5 September 2019 Episode dated 5 September 2019 0 2019 \N \N News,Reality-TV,Talk-Show
tt1092512 tvEpisode Imêji Imêji 0 2007 \N 24 Animation,Comedy
tt10925130 tvEpisode Episode #1.10 Episode #1.10 0 2014 \N \N \N
tt10925118 tvEpisode Episode #1.7 Episode #1.7 0 2014 \N \N \N
tt1092511 tvEpisode Ikinari wa kawarenai Ikinari wa kawarenai 0 2007 \N 24 Animation,Comedy
tt10925120 tvEpisode Episode dated 12 May 2010 Episode dated 12 May 2010 0 2010 \N \N News
Thanks,
Jean
4
philip_persson_9klwyrk6z2308
3 Messages
•
192 Points
5 years ago
0