Skip to main content

3 Messages

 • 

390 Points

Thu, May 29, 2014 9:28 PM

API/Bulk Data Access

Hi!

We’re in the process of reviewing how we make our data available to the outside world with the goal of making it easier for anyone to innovate and answer interesting questions with the data. If you use our current ftp solution to get data [http://www.imdb.com/interfaces] or are thinking about it, we’d love to get your feedback on the current process for accessing data and what we could do to make it easier for you to use in the future. We have some specific questions below, but would be just as happy hearing about how you access and use IMDb data to make a better overall experience.

1. What works/doesn’t work for you with the current model?
2. Do you access entertainment data from other sources in addition to IMDb?
3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?
4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?
5. Does how you plan on using the data impact how you want to have it delivered?
6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?
7. Are our T&Cs easy for you to understand and follow?


Thanks for your time and feedback!

Regards,

Aaron
IMDb.com

Responses

1 Message

 • 

230 Points

6 years ago

I think it's great you have a flat file listing of all the data, however I want to read it with a Perl program, and parsing through your format is not a simple task.  It would be much simpler if the fields were delimited by a '|', or something similar, and a code indicating movie, tv show, etc. I am looking at the actor/actress files specifically, haven't looked at the others as yet.  I don't supposed you already have Perl scripts available that would read the file and insert the data into MySQL?

4 Messages

 • 

122 Points

I just wrote a simple Java program that does just that, it doesn't put them to DB but outputs '|'-separated list that can then be Mapreduce-processed.

I could clean it up a bit and put it up to Github etc if someone else is interested?

1 Message

 • 

64 Points

Now this is something I would really like to see. If you would be so kind to share it with us... much appreciated.

2 Messages

 • 

70 Points

6 years ago

I've looked at all the dump files, but I can't find how you figure out if a Title is a film or a Tv Show. The way I do it right now is by using the running-times.list file. If the running time is over 60 minutes I say it's a film otherwise it's a Tv Show. Is there a better way ?

Champion

 • 

1.9K Messages

 • 

92.6K Points

The Type is defined by the formatting of the title.

A TV Series is wrapped in quotes : e.g. "Fresh Point" (1997)  . 
A TV Episode has a complex title  : e.g. "Fresh Point" (1997) {Are You Ready for Sex? (#1.79)}

Note that this means that all TV Series and Episodes sort to the top of the list

See Submission Guide: Title Formats for the way to recognize the other types.

2 Messages

 • 

70 Points

Wow thank you, I was not aware of this guideline, really helpfull.

1 Message

 • 

60 Points

6 years ago

As a member of /dev/fort, it's really useful for us that you publish full dumps of all the data: we regularly go to locations without an internet connection to build new things, and we wouldn't be able to use IMDB data without access to a full, local mirror of the it.

Champion

 • 

1.9K Messages

 • 

92.6K Points

6 years ago

1) I usually use a database (MS-Access) to manipulate data.

a) For this to work well I need to have all the data on each record. In some of the lists the primary key is missing from most of the records : e.g. in the Actress list is missing her name in all but the first record for each person. This would require my writing a routine to step through the created table, inserting the name in each blank field.

b) The supplied records need to have defined fields, possibly as a .csv or .tsv file.  The use of a unique character such as the pipe (|) would be acceptable, although one of the standard formats would be easier to use.

c) The inclusion of the unique keys (t-const, n-const, etc) would be a tremendous help. While the full name can be used as a key, the consts will be easier to use and faster to process.

3) A single full database would be good to have, but it would also mean downloading a large data set even if I was only interested in one aspect.

4) A good API might be even better than the ftps, provided it is reasonably easy to use and the returned dataset meets the above criteria (all data on each record, unique keys, defined fields).

5) As noted above, I want to load this into a database; therefore a flat file with defined fields is the best option for me.

6) I have not worked with JSON and currently have no software to process such a dataset; as noted above a .tsv file would be the most useful to me.

2) I occasionally get data from other sources (such as tv.com) but IMDb is my primary source.

4 Messages

 • 

302 Points

6 years ago

1. Once a parser is built for the data files, parsing them basically just works, with the one major exception that there are no primary keys, so there's a lot of babysitting the data over time.
2. Sure. FreeDB, Wikipedia, Rotten Tomatoes, Amazon, among other sources.
3. Primary keys, did you say?
4. I like being able to pull all the data down at once and make minimal calls to the server, but an API in addition would not be unwelcome.
5. Sure.
6. JSON would be fine.
7. T&Cs could use some updating/examples/clarification in this mashup/integration/Web2 era.

1 Message

 • 

62 Points

I agree

1 Message

 • 

62 Points

still agreed with this comment

2 Messages

 • 

130 Points

6 years ago

I think the current format leaves a lot to be desired.

1. What works/doesn’t work for you with the current model?
Mostly nothing is working:
  • The format is undocumented, difficult to parse and differs for each file
  • I checked e.g. actors, but there appears to be no way to get the IMDB URL associated with an actor. Same with movies, etc.
  • If I want to get some information from the DB, finding the right file for it is more like a guesswork. It should be documented which file contains what.
3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?
Way more useful:
  • Having primary keys is always better. If the data has that, it does not even have to be in a single file.
  • Having everything in-place makes parsing easier.
  • Please use the IMDB URL (e.g. http://www.imdb.com/name/nm0000158/ for Tom Hanks) as the primary key.
  • I am not sure about the importance of a single large file; I think at least people (actors, directors, etc.) and works (movies, series, etc.) should be in separate files

4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?
They are complementary. APIs are good for occasionally issuing a few queries; bulk data models are good for working with large amounts of data. Having both is the best.

5. Does how you plan on using the data impact how you want to have it delivered?
See above.

6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?
Please. JSON. Everybody can read JSON, it is the obvious choice. The current format mess is not really helpful.

2 Messages

 • 

130 Points

Oh, and one more thing: utf-8, please!

2 Messages

 • 

130 Points

I agree. JSON is a must. If carefully done, a unix friendly flat file is also great (ie one record per line, strictly delimited).
Still, JSON takes priority, because it has structure and it can be extended without breaking existing parsers. Using JSON also eliminates encoding issues (strings must use UTF8).

2 Messages

 • 

70 Points

Do you really want a long string as a primary key? That seems like not a good idea. Especially if you want to join between two datasets. Joining on large strings is slow(er) than it needs to be.

2 Messages

 • 

70 Points

I forgot about the weird encoding the data files use.  UTF-8 would definitely be a very good upgrade.  In my applications one of the first thing I do is convert the strings to UTF-8 encoding.

2 Messages

 • 

72 Points

I just run into the issue. Anyone care to share what encoding they're using?

edit: ISO-8859-1 (or 2), the chardet tool in linux is brilliant :)

5 Messages

 • 

210 Points

Not URLs but IMDb ids would be nice as primary keys, e.g. "nm0000158". And please describe its structure (characters, numbers, max length etc.).

2 Messages

 • 

80 Points

6 years ago

I will followup with the answers to the questionnaire soon. Meanwhile, I want to know the answer to a very basic question. The introduction of the Alternative Interfaces mentions that a 'subset' of data is provided. What here is exactly not being provided?

We are not getting all the individual entities OR we are getting all the IMDb entities but the data against some or all is not complete or both? Thanks.

1 Message

 • 

62 Points

6 years ago

I'm using your data for a local movie database. For the movie thumbnails I'm using a different server.
Your data is very hard to parse, but I think there was documentation somewhere on the FTP.
First of all I'm missing some unique ID. I would be glad if you choose to use your ID (the one which is shown in the url).
Also, it would be much easier to parse and use your data if you're using a standardized format (JSON would be OK or an SQL dump would be fine, too).
I have a very fast internet connection, so I personally do not care if you provide an very big file with all the data, or several smaller files. But for most of the users I think multiple files with an different content type would be better.
An API would not improve the current situation. Sometimes I'm looking for a special movie and perform some complex search querys, cause I only remember some little details about one movie. I think that would not be possible with an API and a local search would be faster.
Thanks for providing the data. I'm looking forward to your changes.

1 Message

 • 

62 Points

6 years ago

1. The format of the current model is a bit tedious to work with:

i) The instructions and data are mixed, which means getting to the computer-readible part of a list requires manual  (and brittle) code to skip the header. The archive should include a README file with instructions, and then a machine-readible file with just the raw data.
ii) Parsing the data is a pain; TSV would be a much preferable format.

2. No
3. Probably more useful, assuming it was easy to manipulate e.g. in SQLite. Currently, a lot of effort is spent on parsing the dataset.
4. Potentially more useful, as long as it was possible to get the full dataset in API 
5. -
6. Yes
7. Yes

1 Message

 • 

60 Points

6 years ago

1. What works/doesn’t work for you with the current model?
Using a common delimiter makes so much more sense for your flat-file dumps. Having a variable number of parameters with a variety of delimiters makes my head hurt and my code slow.
2. Do you access entertainment data from other sources in addition to IMDb?
No.
3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?
I personally find using third-party (not internal) primary keys to be brittle.
4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?
Right now I'm using the IMDB for instruction and performance testing. If IMDB were to provide a JSON+REST API, it would allow me to teach more data access techniques with the same dataset. Right now I simply don't have an application for an IMDB API.
5. Does how you plan on using the data impact how you want to have it delivered?
For real-time applications, having access to a live IMDB API would be awesome. Caching could be done with an internal policy (possibly influenced by IMDB) instead of simply waiting for the FTP dump to be updated. A live API has the potential to use less bandwidth of myself or IMDB depending on the load and the usage compared to the FTP dump.
6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?
I very much prefer JSON as most languages that I use have (near) standard libraries for parsing JSON. It's less verbose (read: smaller) and easier to read than XML. As a maintainer of such a large, plain text data set, I'd still consider keeping a delimited (CSV or equivalent) version if the size difference was noticeable.
7. Are our T&Cs easy for you to understand and follow?
I've never had a problem with your T&Cs.