Skip to main content

Thu, May 29, 2014 9:28 PM

API/Bulk Data Access

Hi!

We’re in the process of reviewing how we make our data available to the outside world with the goal of making it easier for anyone to innovate and answer interesting questions with the data. If you use our current ftp solution to get data [http://www.imdb.com/interfaces] or are thinking about it, we’d love to get your feedback on the current process for accessing data and what we could do to make it easier for you to use in the future. We have some specific questions below, but would be just as happy hearing about how you access and use IMDb data to make a better overall experience.

1. What works/doesn’t work for you with the current model?
2. Do you access entertainment data from other sources in addition to IMDb?
3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?
4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?
5. Does how you plan on using the data impact how you want to have it delivered?
6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?
7. Are our T&Cs easy for you to understand and follow?


Thanks for your time and feedback!

Regards,

Aaron
IMDb.com

Responses

2 Messages

 • 

72 Points

6 years ago

1. I REALLY like the ability to download flat files - they're totally a manageable size, so people should be able to build their own databases / etc if they need features you don't provide.

I agree with the comments to move toward some kind of standard (CSV, JSON, XML, whatever)... the files are tricky to parse in their current formats.

2. Yes and no... I've linked with Rotten Tomatoes data before, and I'm currently looking into linking with MovieLens.

3. I guess it depends on what "primary key" refers to (movies? users? actors? roles? all of the above?)... for my purposes, I'd probably just generate my own IDs anyway, so I don't think it would matter too much. But I'm sure there is some use case where they could be a big help.

4. An API would certainly be nice, but I don't think it's really that necessary - the files aren't that big, so people should be able to build their own query tools. And I'd definitely not want an API if it meant we could no longer download flat files.

5. Definitely! This is a standard rule across all data analysis - your tasks guide the structure you choose. Even with an API, people are going to probably end up doing a lot of custom reshaping anyway.

6. JSON is absolutely sufficient.

7. This page is pretty straightforward: http://www.imdb.com/help/show_leaf?usedatasoftware

Maybe some wording could be clarified, or maybe examples would help. E.g. does "individual personal use" mean that I can use IMDB as a test dataset locally on my machine for a research project, as long as I don't expose the IMDB data to the public? What if my research project is commercial - but I'm selling a system, not the data, and only using the data to demonstrate the system?

This is a really crazy corner case, but information about how to cite IMDB in academic publications would also be useful.

4 Messages

 • 

122 Points

1. I find document-formats such as JSON or XML bad for really big data-sizes, assuming the whole file would be one big document.

However I could go with format where each line would be separate json-object: actor, movie, role etc

This was it way it would be easy to parse, you could even do it parallel. You could handle it with unix-command line tools.

3. IDs (and or primary-keys) are a must to link objects of different types to each other. However I assume IMDB left these out from the free data on purpose not to make to data "too useful". Having the correct IDs would allow you to link your custom app to IMDB-website for example.

---

I don't think file-sizes as is are problematic also storing data as Gzipped is ok as you can really easily unpack it just by adding GzipInputStream (in Java) into the mix. I hope it's as easy in other languages as well.

Having big files would also enable updates as .diff files as well so that one wouldn't have to download big files all over again each time there are changes.

2 Messages

 • 

72 Points

Yeah, JSON is trickier to treat as a stream than CSV... but generally we really only care about that if stuff is "really big"... and IMDB isn't "really big." Put another way, if IMDB doesn't have difficulty dumping it as JSON, we're not going to have much trouble parsing it as JSON.

I'm not saying IDs are a bad idea, of course - it probably would help a lot of things. Including URLs as IDs in the data dump, for example, could really help when it comes to linking external sources (I know at least MovieLens already uses IMDB URLs).

All I meant to say was that IDs aren't important if they're going to take time and effort - if they're easy, then, yes, of course, include them! But if they have to go through and figure out things like "Duchovny, David" == "David Duchovny" in order to give us good IDs, I'd rather they just gave us the messy data (sans IDs) in nicer formats (CSV, JSON, whatever) sooner rather than later.

2 Messages

 • 

72 Points

JSON combined with a NoSQL makes it far easier.
You both should try it.. instead of parsing, "just" loading and it's ready to analyze and use - this unstructured, complex looking recordsets are hellish in something really strict as an SQL server. Every change in record-layout, you will have a hard and time-consuming time.

I'm not saying a RDBMS is worse or better than any NoSQL DB, nor the other way around. I use both alot - whatever makes my life easier : that's what automating crap is for right ? :)

Champion

 • 

1.9K Messages

 • 

92.6K Points

6 years ago

3. IDs (and or primary-keys) are a must to link objects of different types to each other. However I assume IMDB left these out from the free data on purpose not to make to data "too useful". Having the correct IDs would allow you to link your custom app to IMDB-website for example.
Just an historic note.

At the time the lists were created, IMDb did not actually have ID constants; for many years the Primary Name and Primary Title were used as the identifiers for People and projects respectively. This is where the use of Roman numerals began, as it is essential that each primary key be unique.

The addition of the various key constants is a relatively recent addition, and IMDb has not modified the lists to include them (or any other additions).

1 Message

 • 

60 Points

5 years ago

1. There is no coherent link across the files. For example each entity has an identifier like movies have ttXXXX. But if I look at countries.list, the movies there have no id. So one has to do a name search which is inefficient.

2. Not yet, but might need to unless IMDB can provide easier api and syndicated content

3. I don't think a single data set is the answer though maintaining ids across data sets would be useful. It would be good to have a more API like model.

4. A more queryable api that provides access to data would definitely be more useful.

5. yes

6. json is sufficient.

7. no.

2 Messages

 • 

130 Points

5 years ago

OK. I finally decided to roll the sleeves up and do it myself. This script takes some/all of the .list.gz files and converts them to JSON:

https://github.com/oxplot/imdb2json

1 Message

 • 

60 Points

5 years ago

1. What works/doesn’t work for you with the current model?
Getting started from scratch - downloading all the necessary files (which can be quite large), processing them, then importing them into a DB or large data structure like a dataframe before you can even start querying.

2. Do you access entertainment data from other sources in addition to IMDb?
Yes, I've also used OMDb.

3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?
It could be useful in that the processing and importation steps I mentioned above would be minimized/removed altogether, but I can imagine the downloads would be huge, and depending on how it's structured you may be getting tables that you don't want/need.

4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?
Absolutely. Depending on the project I'm working on, I may only need a rather limited set of data per query, and a well-structured and flexible API would be fantastic. The cap rate would need to be reasonable (perhaps a free tier at X requests/sec or whatever, then paid tiers above that), and one would need unlimited access to the data (i.e., if a query returns 5000 results, have the ability to acquire all 5000, perhaps 100 or so at a time).

Personally, I'm much more comfortable working with JSON (preferred) or XML than I am with SQL, so an API would be greatly beneficial to me. On the downside, it would make things like natural language analysis more difficult if the current model were to be discontinued, so I think there's definitely a market for both means of access.

5. Does how you plan on using the data impact how you want to have it delivered?
Yes, see above. I can imagine use cases where an API would be much more effective, and others where having the full dataset immediately accessible would be beneficial.

6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?
Ideally, the requester should be able to specify which format they'd like the data to be delivered in. It should definitely be RESTful so developers don't need to completely rewrite existing code designed for other services.

7. Are our T&Cs easy for you to understand and follow?
As far as I've experienced, yes.

1 Message

 • 

60 Points

5 years ago

A straightfoward way of getting the data into SQL Server would be great. It would make it much easier to write complex queries against the database

2 Messages

 • 

72 Points

2 Messages

 • 

72 Points

5 years ago

If you would use a NoSQL instead of insisting on SQL-servers, you wouldn't have such a hard time.
This data is awesome and is made for MongoDB or the likes.

API would be sweet as extra tool, but I'd be ok with direct access to your Mongod, with reading auth :) Just saying hehe :)

1 Message

 • 

62 Points

5 years ago

I'm just starting to look at the situation - we haven't used IMDB prior to this.  We're using Drupal, for which there are some IMDB related modules but I haven't tested them.  Since I don't have enough information yet to answer your questions reliably, so first I'll just describe how I would like to incorporate IMDB data into our existing movie page listings.  (see https://integratedspaceanalytics.com/cms/movies)

FIrst, we are only interested in movies related to space, space exploration, etc.  This might be fiction or documentary, etc.  We already have a substantial database of relevant titles, with pictures and summary information, along with user-provided data.  I would like to add selected data from IMDB (not sure what yet).  The IMDB section of the page would be linked directly to the IMDB page, so people can get further information if they want.

Now, to the questions:

1. What works/doesn’t work for you with the current model?
I don't have any answer for this yet.

2. Do you access entertainment data from other sources in addition to IMDb?
Our existing database has been generated internally, and from our users with some data manually collected from Wikipedia.

3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?
I think we could work either way. Pulling the data once per day makes less load on our servers as well as yours.

4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?
We haven't used the existing FTP data yet.

5. Does how you plan on using the data impact how you want to have it delivered?
Not at this time.

6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?
I believe that either JSON or TTL ('turtle' RDF) would be OK.

7. Are our T&Cs easy for you to understand and follow?
I haven't read them yet! :D  My expectation would be that in addition to the visible citation to IMDB, we would certainly intend to link to the IMDB site, either to the relevant page in IMDB, or if you prefer, to the main IMDB page.  We strongly believe in accurate, reliable reference information, including the date of retrieval.  We generally also cache data we collect, to maintain referential integrity, and in case a remote service is not available.  We would assume/hope that we could continue to publish that cached data under the same constraints as the original.  We certainly appreciate the hard work that IMDB has put in to supplying the data, and want to assure that our users are aware of our sources.

If desired, if you do make an api we would also consider supporting a return data channel, such as potential corrections to your data, reviews, or votes.

3 Messages

 • 

142 Points

4 years ago

Hello!

Thanks for the initiative.

I can see that some of the requests have been focused on the particular format of your data.  I suppose this has its importance, but I don't care as long as it's machine-readable.  As a member of a foreign, somewhat small country (Denmark), what I do care about is the availability of data related to non-US usage.

For example, your FTP archive does contain some foreign-language AKA titles, but I could only find German and Italian ones.  Meanwhile, the standard IMDb web interface shows AKA titles for lots of countries.

In general, my request is to include more foreign data in the public archives. :)

--
Niels

Champion

 • 

825 Messages

 • 

46.1K Points

Hi, Niels:

Ignore the files german-aka-titles.list and italian-aka-titles.list and simply download aka-titles.list. It includes all the AKA titles (Italian and German too...).

Agradable

3 Messages

 • 

142 Points

Hi!

Yes, I noticed that list as well.  However, it does not actually contain all the AKA titles present in the IMDb web interface.  All non-English AKA titles that I managed to find in that list are AKA titles of the same language as the film, typically working titles and such.

The list does not seem to contain any translated AKA titles.  For example, the film Sister Act http://www.imdb.com/title/tt0105417/ has many AKA titles on http://www.imdb.com/title/tt0105417/releaseinfo#akas but only German and Italian ones in the aka-titles.list text file.

I'm guessing this might have something to do with not including too much unverified data in the public archives, or maybe just keeping the archives simple.  Still, it's useful data and would be nice to have included. :)

Champion

 • 

825 Messages

 • 

46.1K Points

it does not actually contain all the AKA titles present in the IMDb web interface
You're right! I hadn't noticed it until now. I've checked an old version of aka-titles.list (dated 20 Nov 2015) and then too it contained only those two AKAs for Sister Act, so this is an existing problem for a long time.
I'm guessing this might have something to do with not including too much unverified data in the public archives, or maybe just keeping the archives simple
Like happened once with the movie-links.list file, I hope this is just a situation IMDB isn't aware of, and that it'll be fixed as soon as they read your post from above. aka-titles.list is one of the smallest archives in the repository (current file contains 465,441 aka titles and, for comparison, actors.list includes over 13,000,000 credits) so I don't think the simplicity is an issue.

Thank you very much for reporting this problem!

Agradable

3 Messages

 • 

142 Points

Thanks for following up so quickly!

1 Message

 • 

60 Points

Is there an update to the missing aka-titles? So far nothing changed.

1 Message

 • 

60 Points

4 years ago

Hello, I have a couple of questions:
1. I'd like to ask if there's a way to get the accurate votes breakdown for each movie? for example like this one:http://www.imdb.com/title/tt0111161/ratings?ref_=tt_ov_rt
I know there's a distribution column in the ratings.list file, but it's not very accurate.
2. Is there a way to turn the files into an SQL database?