IMDb Data – Now available in Amazon S3

This is an announcement for customers of the IMDb bulk data available via FTP.

We are pleased to announce, starting today IMDb datasets are now available in Amazon S3 via an HTTPS link. Using the new interface, customers can bulk-access IMDb title and name data.

For details on the S3 solution, file format and access guidelines, see www.imdb.com/interfaces.

In our continued effort to best serve our Contributors, we are streamlining the datasets and making them available in a more useful and structured format in S3. Notably:

Data refresh frequency is now daily (previously weekly).
IMDb title and name identifiers are included in all the files for ease of matching and linking back to IMDb.
The files are in tab separated values (TSV) format.
The sets of data we provide are updated to only include the essential ones that help with matching and linking to an IMDb title or name.

As part of housekeeping the FTP site, the data files will no longer be updated. The list data files will continue to be available at two locations (see below) until February 28, 2017. We strongly encourage FTP site users to switch to the S3 solution at the earliest to ensure their applications continue to work without interruption.

ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/frozendata

ftp://ftp.fu-berlin.de/pub/misc/movies/database/frozendata

If you are not an IMDb Contributor and wish to obtain IMDb content for commercial use, we offer a content license. The license grants you access to our content via an XML web service, plus the right to use the content in your product or service. If that interests you, please email licensing@imdb.com.

If you have any questions or concerns, please share your feedback in this thread.

Thank you for your continued support.

Responses

chuckkahn

2 Messages

•

90 Points

6 years ago

I do not approve of this merging of every topic into the "IMDb Data - Now available in Amazon S3" topic. Why can't we have separate sub-topics? Un-merge please!

0

govert_schipper

2 Messages

•

130 Points

6 years ago

Hi, I want to use the datasets (for personal use) and just discovered this thread. Where can I reports issues with the datasets? What I found today in the file with name.basics is a name ID (nconst) with length 10 while all the others have a length of 9. It is nm21954871 (Bela Lenkehegyi). Because of this I got a duplicate while importing the file in an SQL table.
Furthermore I found that on the main page the description for the file containing title.principals if no longer consistent with the actual data. These is no column named 'principalCast' but instead several other columns exist. Of course you can found this out when you look at the first line in the data file, but it took me a while to found out.
Thanks for providing the datasets and I hope this comment will make them even better.

3

chris_h_7i63q04tc55pm

Employee

•

66 Messages

•

3K Points

Hello Govert,
You will be seeing more nconsts with a length of 10 characters over time so you need to account for both 9 and 10 character length ids.

Thank you for pointing out the discrepancy in the descriptions of the file formats. I will file a ticket with the relevant team to have this corrected.
Best regards,
Chris.

Like

Reply

6 years ago

0

jeorj_euler

10.6K Messages

•

224.9K Points

At what point in time did the nconst values finally extend beyond (the not allocated) nm9999999? I didn't even notice.

— Jeorj Euler, an IMDb regular registrant

Like

Reply

6 years ago

govert_schipper

2 Messages

•

130 Points

Hi Jeorj, I could found only one occurence of a name with length, but the maximum value for 9 characters is imminent, so be prepared for more to pop up.

Like

Reply

6 years ago

jeorj_euler

10.6K Messages

•

224.9K Points

6 years ago

Hi, Govert Schipper. If you want an "nconst" regular expression, then the regex you're looking for is

/^nm([1-9]+\d*)?\d{7}$/

or plainly...

/^nm\d$/

0

tim_griffin_6g0kubginep9z

2 Messages

•

142 Points

6 years ago

We have been using the FTP data in our undergraduate database course at the University of Cambridge, UK. See https://www.cl.cam.ac.uk/teaching/1718/Databases/materials.html. I was very sad to see that this data is no longer available. Yes, they were a pain to parse, but that was a real issue. However it was one of the richest large data sets on the web that students did not require domain-specific training in order to understand. I guess I'll be using the dump I snagged in August of 2017 for the next couple of years until I find another data source for the course. BTW, the new data file title.principals.tsv.gz has been empty every time I have looked. perhaps this will contain the data I need? For example, we want actors in movies so we can do the "kevin bacon number" type queries.

2

chris_h_7i63q04tc55pm

Employee

•

66 Messages

•

3K Points

Thank you for reporting the missing data from the principals file, I have reported it to the tech team for fixing.

Best regards,
Chris.

Like

Reply

6 years ago

chris_h_7i63q04tc55pm

Employee

•

66 Messages

•

3K Points

title.princials.tsv.gz now has the data populated again.

Best regards,
Chris.

Like

Reply

6 years ago

robert_mcgirr_h1vwu6aq0r7dx

6 Messages

•

172 Points

6 years ago

I am seeing Title Id's present in the Title file that produce 404 errors when one tries to hit the page Id tt7430188 is one example. Is there a reasonable time to data set file update we can expect, a max time in which the files will be synced with IMDb? I also noticed last week that some Episode Id's were in the Title file but did not make it to the Episode file for a few days. I am just looking for info on what we should expect in terms of file update latency, before we start wondering if our ingestion process may be at fault.

Thanks

0

robert_mcgirr_h1vwu6aq0r7dx

6 Messages

•

172 Points

6 years ago

The Title.basics file today contains 2,232,860 duplicate ID's

0

k247

3 Messages

•

180 Points

5 years ago

Since a couple days the all the genres seem to be gone from the title.basics.tsv ?
All entrys have been replaced with \N now. As far as i can see not a single movie has a genre anymore.

old file
tconst titleType       primaryTitle    originalTitle   isAdult startYear       endYear runtimeMinutes genres
tt0000001       short   Carmencita      Carmencita      0       1894    \N      1       Documentary,Short
...

current file
tconst titleType       primaryTitle    originalTitle   isAdult startYear       endYear runtimeMinutes genres
tt0000001       short   Carmencita      Carmencita      0       1894    \N      1       \N
...

Did this happen by accident? Or is this Data not available anymore period?

1

robert_mcgirr_h1vwu6aq0r7dx

6 Messages

•

172 Points

Any information available on this? Should we stop looking for Genres in the data sset now or are they coming back at some point?

Like

Reply

5 years ago

robert_mcgirr_h1vwu6aq0r7dx

6 Messages

•

172 Points

5 years ago

Title.basics file today is rife with duplicates.

0

rob_fx6qwf85pughq

3 Messages

•

90 Points

5 years ago

The title.basics file has had a large number of duplicate rows for the past 2 days.

0

ryang_1868206

4 Messages

•

302 Points

5 years ago

It does not look like the s3 datasets repository has been updated since Dec. 4.

0

rob_fx6qwf85pughq

3 Messages

•

90 Points

5 years ago

I cannot find any information on the issue with the data-sets. The files have not been updated in several days, and there remain duplicates rows ion the title.basics file.
Any info on the state of the files and expected resolution would be greatly appreciated.

0

brian_ha0kykb3z5haw

1 Message

•

100 Points

4 years ago

Is there any way we can get plot summaries added to the dataset? Also, it looks like the data is no longer on S3 but on datasets.imdbws.com instead, right? Thank you for keeping this data available!

1

Marco

2.7K Messages

•

82.3K Points

I really agree with you Brian. Seven months ago, IMDb's update was that they didn't have an update: https://getsatisfaction.com/imdb/topics/attn-staff-duplicate-verbatim-plot-summaries?topic-reply-lis...

I'm afraid my bump there from three weeks ago won't result in much...

Like

Reply

4 years ago

boris_bim2yqa1r4zzp

17 Messages

•

380 Points

4 years ago

Hi,

any chance that you make a list as it was before on both other servers for the production companies:

ftp://ftp.fu-berlin.de/pub/misc/movies/database/frozendata/production-companies.list.gz

Or at least their data from 12.2017 on.

Thanks,

Z

1

jeorj_euler

10.6K Messages

•

224.9K Points

I'm hoping Rida sees this.

— Jeorj Euler, an IMDb regular registrant

Like

Reply

4 years ago

0

1
2
3
4
5

sv_6654070

IMDb Data – Now available in Amazon S3

chuckkahn

govert_schipper

chris_h_7i63q04tc55pm

jeorj_euler

govert_schipper

jeorj_euler

tim_griffin_6g0kubginep9z

chris_h_7i63q04tc55pm

chris_h_7i63q04tc55pm

robert_mcgirr_h1vwu6aq0r7dx

robert_mcgirr_h1vwu6aq0r7dx

k247

robert_mcgirr_h1vwu6aq0r7dx

robert_mcgirr_h1vwu6aq0r7dx

rob_fx6qwf85pughq

ryang_1868206

rob_fx6qwf85pughq

brian_ha0kykb3z5haw

Marco

boris_bim2yqa1r4zzp

jeorj_euler

Helpful Widget

sv_6654070

IMDb Data – Now available in Amazon S3

chuckkahn

govert_schipper

chris_h_7i63q04tc55pm

jeorj_euler

govert_schipper

jeorj_euler

tim_griffin_6g0kubginep9z

chris_h_7i63q04tc55pm

chris_h_7i63q04tc55pm

robert_mcgirr_h1vwu6aq0r7dx

robert_mcgirr_h1vwu6aq0r7dx

k247

robert_mcgirr_h1vwu6aq0r7dx

robert_mcgirr_h1vwu6aq0r7dx

rob_fx6qwf85pughq

ryang_1868206

rob_fx6qwf85pughq

brian_ha0kykb3z5haw

Marco

boris_bim2yqa1r4zzp

jeorj_euler

Related Conversations

Helpful Widget