V

3 Messages

 • 

90 Points

Thu, Nov 25, 2021 12:01 PM

In Progress

Downloadable datasets miss some data

I'm trying to use the datasets downloadable at https://www.imdb.com/interfaces/ . The problem is that some person codes contained in title.principals.tsv.gz have no match in name.basics.tsv.gz, which is supposed to associate codes with a name. For example, code nm6592173 does not appear in name.basic.tsv.gz, but it is part of a crew:

❯ zgrep nm6592173 title.principals.tsv.gz
tt3826724 6 nm6592173 actor \N ["Soldado 3"]

Anybody has a hint as to why, or whom should I contact to report this problem?

Champion

 • 

4.4K Messages

 • 

111.9K Points

6分前

Looks like nm6592173 has been merged into nm0739580, since 

https://www.imdb.com/name/nm0739580/

is what comes up when I put nm6592173 into the IMDb search bar.

Did you pull both files at roughly the same time?

3 Messages

 • 

90 Points

6分前

I downloaded them within seconds a few hours ago, and they're supposed to be refreshed every day.

I downloaded both files again. There are 3199 identifiers in the same situation. I guess one has to live with the fact that these files are not 100% correct and implement workarounds.

Employee

 • 

13.4K Messages

 • 

271.2K Points

6分前

Hi vigna -

I will relay this issue to the appropriate tech team, however, I want to ensure I fully understand the datasets issue.  As I understand,  name codes associated with pages that that have been merged are not showing up in "name.basic.tsv.gz", is that correct?

So for example, as nm6592173 is tied to a merge, this code is not showing in "name.basic.tsv.gz, correct?  Is the name code nm0739580 showing up?

3 Messages

 • 

90 Points

6分前

Yes. nm6592173 does not show up, but nm0739580 does.

In title.principals.tsv.gz there are about 7000 codes without an associated name. You can find them at http://vigna.di.unimi.it/missing.txt

There are also this three lines which definitely seem a bug:

nm12078543 [link=nm0296484] \N \N  \N
nm12078544 [link=nm1070816] \N \N  \N
nm12078546 [link=nm11049918] \N \N  \N

Employee

 • 

13.4K Messages

 • 

271.2K Points

Hi @vigna​ -

My apologies for the delayed reply.  I have filed a ticket for the appropriate team to review, as soon as I have an update I will let you know here.