jean_liu's profile

5 Messages

 • 

264 Points

Sunday, July 12th, 2020

Closed

Solved

IMDb title.basics.tsv.gz dataset missing genres

The IMDb title.basics.tsv.gz dataset is missing genres which shows only \N
https://www.imdb.com/interfaces/
Oldest First
Selected Oldest First

8.8K Messages

 • 

179.5K Points

5 years ago

  
JEAN LIU
Joined community on July 12, 2020 - new today
- - -  
Data missing from published IMDB database
https://getsatisfaction.com/imdb/topics/data-missing-from-published-imdb-database
  
Please note that regarding the database files published here:
https://datasets.imdbws.com/
The genre data is missing from the 
title.basics.tsv.gz file.
  
Philip Persson
Joined on July 11, 2020
Posted July 11 2020
.

3 Messages

 • 

192 Points

5 years ago

This reply was created from a merged topic originally titled Data missing from published IMDB database.

Please note that regarding the database files published here: https://datasets.imdbws.com/

The genre data is missing from the title.basics.tsv.gz file.

Employee

 • 

1.2K Messages

 • 

36.3K Points

5 years ago

Hi there,

Thanks for the post.

I've cut a ticket to the necessary team to look into for you.

Thanks,

Joel 

Employee

 • 

1.2K Messages

 • 

36.3K Points

5 years ago

Hey again Jean,

Similar to the other conversation, I'd recommend giving this file another download and taking a look to see if the problems no longer present.

Sometimes there are problems with data generation which is what you might be seeing.

Let me know!

Cheers,
Joel 

Employee

 • 

1.2K Messages

 • 

36.3K Points

Hey Phil,

Thanks for the post and sorry to see the other thread was closed without answer...not sure why that happened, so I'll follow up with the agent. 

I can understand the repetitive answer about these datasets sounds fairly dismissal - I've passed your feedback onto the dataset team to look into.

Cheers,
Joel 

2 Messages

 • 

70 Points

Hi, I'd just like to point out that the basic dataset appears to missing the genre data yet again? If anyone could figure out a permanent solution to this problem that would be most appreciated. The basic datasets are limited enough as it is, not being able to access genre would really limit the kinds of interesting data we could generate.

Thanks in advance

2 Messages

 • 

70 Points

Sincere apologies, it appears there are only certain movies/shows for which genre is not available, I can find it most entries just fine! :)

5 Messages

 • 

264 Points

5 years ago

Hey Joel,

Thanks for the update and the new dataset looks good! From Phil's comments though it seems like a recurring problem that the IMDb team can perhaps pay more attention to.

Also, I hate to diverge from the topic but from checking further into the data I noticed another interesting issue: many titles with "tconst" ending in 9 have the number "9" is truncated.

Here are a few examples:
  • The correct tconst for Imeji should be tt10925129 not tt1092512, and
  • The correct tconst for Ikinari wa kawarenai should be tt10925119 not tt1092511

tt10925128 tvEpisode Episode dated 5 September 2019 Episode dated 5 September 2019 0 2019 \N \N News,Reality-TV,Talk-Show
tt1092512 tvEpisode Imêji Imêji 0 2007 \N 24 Animation,Comedy
tt10925130 tvEpisode Episode #1.10 Episode #1.10 0 2014 \N \N \N

tt10925118 tvEpisode Episode #1.7 Episode #1.7 0 2014 \N \N \N
tt1092511 tvEpisode Ikinari wa kawarenai Ikinari wa kawarenai 0 2007 \N 24 Animation,Comedy
tt10925120 tvEpisode Episode dated 12 May 2010 Episode dated 12 May 2010 0 2010 \N \N News


Thanks,
Jean

Employee

 • 

1.2K Messages

 • 

36.3K Points

Hey again Jean,

Thanks for getting back in touch.

I think this is a misunderstanding on how the data is presented.

If we look at Imeji on IMDb, you can see it's tconst is actually tt1092512 and not tt10925129.

Similarly, if we look at  Ikinari wa kawarenai on IMDb the tconst is tt1092511 and not tt10925119.

I'm guessing this assumption is (understandably) because of its position in the file (i.e. between tt10925128 and tt10925130). - but it's actually just because the consts are ordered as text not a number.

I hope this helps!

Cheers,
Joel 

Champion

 • 

20.4K Messages

 • 

487.1K Points

Joel,

Put another way, the tconst are character strings not numbers. Character strings can only be sorted in alphabetic (i.e. lexagraphic) order and are not sorted in numeric order.

5 Messages

 • 

264 Points

Got it, that makes sense. Thanks Joel and Dan! :)

3 Messages

 • 

192 Points

5 years ago

The genres are there now!  Thank you!  :-)