stewt's profile
Employee

Employee

 • 

15 Messages

 • 

920 Points

Thursday, December 20th, 2018 12:36 PM

IMDb Data – Now easily available to contributors

Today (20 Dec 2018) we are pleased to announce the IMDb datasets are easier to access and now directly from imdb.com. Using the new interface, contributors can bulk-access subsets of IMDb title and name data for personal and non-commercial use. Each dataset file is in a gzipped, tab-separated-values (TSV) format.

To access the datasets and for more information you can go here: https://contribute.imdb.com/czone

Stewart

10.7K Messages

 • 

225.4K Points

6 years ago

Thanks!

10.7K Messages

 • 

225.4K Points

However, I do hope that the extended datasets can be made available to a slightly broader span of frequent contributors.

Employee

 • 

15 Messages

 • 

920 Points

6 years ago

Hello,

The Extended Datasets are available to those who have 1000+ approved contributions in the last 360 days, otherwise the Basic Datasets are available with just one approved contribution in the last 60 days.

I hope this helps.
Stewart

2.4K Messages

 • 

81.2K Points

Thanks a lot Stewart.

85 Messages

 • 

1.9K Points

Thank you for the response!

85 Messages

 • 

1.9K Points

@stewt​ How do I see how many contributions I've made in the last 360 days?

Also, is a "contribution" each submission I make even if there are multiple items being updated in the submission? or is it each separate item?

Thank you.

10.7K Messages

 • 

225.4K Points

A contribution is each approved item within a submission.

85 Messages

 • 

1.9K Points

@jeorj_euler​ Thank you

7 Messages

 • 

362 Points

6 years ago

Ok, it's nice to finally see that more data is available (to some people).
I see no way to qualify as I have a wide range of interests and I'm not from the industry so getting 1000 updates in 360 days will be never possible (I only made maybe 5 updates in the last 15 years and pointed to serveral issues with broken/incomplete exports of the LIST files via the normal support pages for ~19 years).

The Java Movie Database (JMDB) application...
What should make me qualify to access the data is the fact that I'm the author of the Java Movie Database (JMDB) which is available since 19 years (first versions only to a limited set of beta testers). --> http://www.jmdb.de/
Contact details can be found on the above website.

There is no other free application I know of processing the (public) IMDb data for that long. The only thing coming close is the IMDbPY project.
While the application was originally created by two persons I'm the only one left (since 15 years).

The JMDB application allows to import/process the old LIST files (which is still the base for the search inside the application) plus the TSV files (not yet used inside the application to search).

The reason that I didn't update the code to use the TSV content inside the application beside adding support to import the data is basically because the available content is so limited, that it is actually useless.
Sorry, but there is no other (nicer) way to sum it up when it comes to the TSV file format.
This is what the people already complained about in the original thread (https://getsatisfaction.com/imdb/topics/imdb-data-now-available-in-amazon-s3).

So I would really like to see how the "complete" stuff actually looks like and if it's actually for the following reasons:
  1. Extend the TSV data import that JMDB supports to the full data-set (not only limited to the "free" stuff available now - which is useless)
  2. Making the new TSV content searchable (if it's actually worth the effort - not for details limited to 10 entries (title.principals) per movie and TOP-4 movies/titles per name (name.basics) while the old LIST format offered the full list of people involved in a title)
  3. Maybe offering some functionality to the professional "contributors" to use the application to send updates to the IMDb website (API available?)

As this has been under the radar here other use of the IMDb data in the past...
I also want to share that the IMDb data has been and still is (at least the frozen LIST files) used together with JMDB:
  1. by students/postgraduates (postdocs) from universities for data analysis, etc. (also in master thesis) as can be seen e.g. here (there are more like TU Berlin, Germany): UiO University Oslo Norway - https://www.uio.no/studier/emner/matnat/ifi/INF3100/v16/undervisningsmateriale/filmdatabasen-og-post... and
  2. it is used in computer science teaching lessons at schools as can be seen here: https://www.swisseduc.ch/informatik/120-lektionen/principles/recollection/sql-imdb.html
The imported IMDb data has been used to create covers or other internal notes on titles that have been recorded from TV

Finally some technical issues with the format...
There are also some technical issues I have starting with the fact that inside each of the *.tsv.gz archives the name of the file is "data.tsv".
Normally the filename of the compressed file should be equal to the file found inside the archive minus the ".gz" compressor extension (Example: "title.crew.tsv.gz" --> "title.crew.tsv").
It's broken since the beginning. I you extract all files you have downloaded into the current/same directory you end up with only one file - that was extracted from the "last" archive processed.

When the data is imported with JMDB from the compressed gzip file and you try to add foreign key constraints to the filled tables containing the data it will not work for some cases as there are references in the e.g. title.crew.tsv (person reference) that has no matching entry in the name.basics.tsv file so the relational database tells you that your data basically is incomplete. There is more of this.

Kind regards,
Juergen Ulbts


1 Message

 • 

122 Points

Let me support Juergen's note.

I am an academic who has used his incredibly useful JMDB tools to manipulate and organize the IMDB LIST files as part of my research. His efforts have made using those IMDB flat files much easier for people like myself whose programming skills are sub-par, and as such I think advances the mission and visibility of IMDB.

I am pretty sure the target audience of the contributor files would include individuals like myself who are interested in having comprehensive files (in my case for statistical analysis). I imagine industry folks and those more interested in following certain titles would use either the regular webpage or (in the case of industry folks) using IMDB Pro.

So my hope is that these additional files are made available to volunteers like Juergen who provide the tools which many of us use to access the raw files. I think his software is a meaningful contribution to IMDB, and it would make sense both from the perspective of advancing IMDB and also for a reward for pro-social behavior behavior to expand the definition of "contributor" to include tool-making.

Best,
Koleman

17 Messages

 • 

852 Points

5 years ago

What's in the extended data?  The stuff I used the most from the old files (connections and keywords) isn't in the basic data making it a lot less useful to me.  I'm a contributor, but not on a mass scale.  But a large share of my contributions have been keywords, and I'm rather unsatisfied that I can't download that information to make checking things easier.

10.7K Messages

 • 

225.4K Points

The extended datasets include the following:
  • name
    • name.basics
    • name.filmography
    • name.jobs
    • name.quotes
    • name.trivia
  • title
    • title.akas
    • title.basics
    • title.cast
    • title.certificates
    • title.crew
    • title.episode
    • title.principals
    • title.ratings
    • title.releases
    • title.trivia
The data for connections and keywords, as near as I can tell, were taken away from everybody who is not IMDb staff.

10.7K Messages

 • 

225.4K Points

By the way, Phil G already provided this answer, back on in January or 2019. We would appreciate it if people would read the whole thread before commenting.