15 Messages
•
820 Points
IMDb Data – Now available in Amazon S3
This is an announcement for customers of the IMDb bulk data available via FTP.
We are pleased to announce, starting today IMDb datasets are now available in Amazon S3 via an HTTPS link. Using the new interface, customers can bulk-access IMDb title and name data.
For details on the S3 solution, file format and access guidelines, see www.imdb.com/interfaces.
In our continued effort to best serve our Contributors, we are streamlining the datasets and making them available in a more useful and structured format in S3. Notably:
ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/frozendata
ftp://ftp.fu-berlin.de/pub/misc/movies/database/frozendata
If you are not an IMDb Contributor and wish to obtain IMDb content for commercial use, we offer a content license. The license grants you access to our content via an XML web service, plus the right to use the content in your product or service. If that interests you, please email licensing@imdb.com.
If you have any questions or concerns, please share your feedback in this thread.
Thank you for your continued support.
We are pleased to announce, starting today IMDb datasets are now available in Amazon S3 via an HTTPS link. Using the new interface, customers can bulk-access IMDb title and name data.
For details on the S3 solution, file format and access guidelines, see www.imdb.com/interfaces.
In our continued effort to best serve our Contributors, we are streamlining the datasets and making them available in a more useful and structured format in S3. Notably:
- Data refresh frequency is now daily (previously weekly).
- IMDb title and name identifiers are included in all the files for ease of matching and linking back to IMDb.
- The files are in tab separated values (TSV) format.
- The sets of data we provide are updated to only include the essential ones that help with matching and linking to an IMDb title or name.
ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/frozendata
ftp://ftp.fu-berlin.de/pub/misc/movies/database/frozendata
If you are not an IMDb Contributor and wish to obtain IMDb content for commercial use, we offer a content license. The license grants you access to our content via an XML web service, plus the right to use the content in your product or service. If that interests you, please email licensing@imdb.com.
If you have any questions or concerns, please share your feedback in this thread.
Thank you for your continued support.
Official Response
Col_Needham
Employee
•
7.4K Messages
•
180.3K Points
7 years ago
On the S3 access issues, we now have a working prototype of a system which can make the same S3 data available to you via HTTP from IMDb directly without requiring any S3 registration and free from any possibility of AWS charges. Please watch for an announcement as we convert this into production code. The only thing needed will be an ordinary IMDb user account attached to a valid email address. We still intend to also make the data available via S3 for those people who find the AWS access tools more convenient and can stay within the free tier of AWS.
On the general data availability, we are adding the AKA titles to the basic data set accessible to everyone. Longer term, we are looking at the possibility of daily diff files for at least some of the data in the basic set.
On the point about contributors, we are looking at extending the range of data available via the http solution based on your contribution history and volume. For top contributors and those people using the data to help us clean it via bulk corrections, this is likely to extend far beyond the current set of data even on the FTP site. It is not our intention to deprive access to the data by those people who have genuinely helped to build it over the years and who want to continue to improve IMDb. We aim to also be able to grant specific permissions to specific customers for specific extra subsets of data as required on a case by case basis. This latter part may take some time to become a fully formed solution so please bear with us.
The background to all of this is that there is a huge multi-year technology migration project which is nearing completion at IMDb. We have too many complicated old systems around which have been slowing the overall pace of development (I add a bit more detail to this on https://getsatisfaction.com/imdb/topics/why-doesnt-imdb-staff-ever-consult-with-the-contributor-base...). The move to the new technology has been providing the opportunity to look at the way we operate different parts of the IMDb service. One of the oldest software systems is the one which publishes the FTP data, and we will soon no longer to even be able to generate the .list files once the final pieces of the old IMDb system are decommissioned; at least not without re-writing all of the publication software to connect to the new system and produce an extremely difficult to manipulate text file format which was designed 27 years ago and has not changed in 21 years. Instead, we decided that it would be better to publish the data via a modern system (S3 and soon over https) in a modern format which can be more easily parsed. The other problem with FTP is that we have no idea how many people are using the data and for what purpose, nor do we know what additional things they may want from the data. From feedback over the years, we knew some of your requirements already, notably (a) access to the title and name constant data (b) an easier to parse format (c) information to help in matching other catalogs to IMDb (d) more frequent updates. We found ourselves having to guess the remaining requirements until we decided the best way forward was to move the data to a new location within the FTP sites, post an announcement on Get Satisfaction (this thread) and then wait to gather feedback before replying and figuring out what steps to take next (this reply).
We hope this helps. We have plenty to be working upon in the meantime, and we will follow-up as we deliver parts of the above.
Col
Founder & CEO, IMDb.com.
6
Official Response
sv_6654070
15 Messages
•
820 Points
7 years ago
Please stay tuned for more updates. Thanks!
5
Official Response
chris_h_7i63q04tc55pm
Employee
•
66 Messages
•
3K Points
7 years ago
Earlier in this thread, Col referred to a prototype of a system which can make the same S3 data available to you via HTTP from IMDb directly without requiring any S3 registration and free from any possibility of AWS charges. This system will require an ordinary IMDb user account attached to a valid email address. However, this system is not yet quite ready for production so to help address some of the concerns raised about the 'Requester Pays' access via S3, today we activated an https entry point to provide access to the basic datasets. This https location is here, https://datasets.imdbws.com/ The page http://www.imdb.com/interfaces/ has been updated with this information.
We are finalizing the extended datasets and access model and I will post an update about that as soon as it is ready.
The final build of the data that gets published to the FTP mirrors occurred yesterday so those mirrors contain the final FTP snapshot. While the data on the FTP servers will not be updated going forward, we will not remove the data for at least the next few weeks so people who need that data can still download it.
5
dgranger
3.5K Messages
•
85.6K Points
7 years ago
3
0
matisszz
3 Messages
•
90 Points
7 years ago
0
0
ddb_hsnrsxjeb6gha
2 Messages
•
110 Points
7 years ago
2
matisszz
3 Messages
•
90 Points
7 years ago
I'm trying to get the full and up-to-date dataset for alternative movie titles. I first tried the archive from the ftp servers - aka-titles.list.gz - but some titles appear to not have all the information.
Example:
http://www.imdb.com/title/tt0090557/releaseinfo?ref_=tt_dt_dt#akas
That has 17 alt. titles, while aka-titles.list contains only 3.
I figured the new S3 datasets would be better, but those don't even have alt. titles available, just the most basic movie information.
Any idea how this could be resolved?
0
0
sv_6654070
15 Messages
•
820 Points
7 years ago
To help us better understand your usecase, please share details on how you use/plan to use the alternate titles data.
0
0
clcdpc
2 Messages
•
152 Points
7 years ago
2
jw_51j654boziha0
1 Message
•
120 Points
7 years ago
0
marcel_korpel
5 Messages
•
210 Points
7 years ago
So, in short, it seems that there is less data (that is easier to parse, I assume) and you actually have to pay and jump through hoops setting up (an) account(s)* to an AWS and S3 to be able to access the bulk data using an API I have to learn using. If I am correct, you have to pay for the bandwidth, so are there even diff files provided to lessen that burden, like on the FTP sites?
All in all, I am sad to say this doesn't sound as an improvement to me.
* Using a credit card, which is not that common in several countries in the world (in the Netherlands, for instance, debit cards are far more common)
1
gardner_von_holt
16 Messages
•
792 Points
7 years ago
While I accept that things change over the years, the main thing that I would desire to have restored is complete actor and actress credits for future movies.
I have never used the data for anything other than my personal movie collection, and despite being a software developer have never offered this software publicly to date.
I have invested significantly over the years in developing my internal movie database oriented software to enhance my enjoyment of movies, and have likewise for many years bought most of my movies from amazon.com, .de, or .co.uk. If I gave in on this I would be reducing years of development of my movie tracking to wasted effort.
You can measure my length of use of the data and imdb to my userid (gvh), one I could no longer acquire today, and that I have had an account with amazon for about 20 years.
Additionally, I believe Amazon made representations when you purchased IMDB to continue to make it available to the public, and have for many many years provided this data for non-commercial purposes. I have invested very significantly in this software and am frustrated to see access withdrawn to data that has always been there.
What would make me happy? Any of the following:
* Ability to download full data for movies you purchased from any amazon business, or view on amazon prime movies (I get data for those movies I pay for). Im totally ok with limiting my access to those movies I have a commercial relationship with amazon about.
* Some API (rate limited would be ok) to return the full cast for an individual movie
This would allow me to add new entries for newly purchased movies, a vast majority of which I buy from amazon. And limiting me to a handful of queries a day or week would be acceptable. (I get 5 queries a day plus 5 queries for each movie I purchase at amazon, for example)
* Full data limited to recently released movies and dvds (so that when I buy a movie I can add new data). I rarely purchase new movies that are back catalog. Almost everything is purchased when the dvd goes open for sale, so any clever way of limiting me to newly released movies would solve much of this need.
* Ill even pay a token amount per set of queries if you want some way to guard against being spammed with accounts. Or you can charge my AWS account if thats possible.
S3 as a data source, and tabbed text is fine for me, as would any sort of API or method you could offer, Ill write whatever software is necessary to access the data, and as I said, this is only necessary for new titles, so I'm totally ok with rate limiting my access (given that there is some way to test and develop with sample data or something in a non-rate limited way)
Thanks, and I can be reached at the email associated with this account for further discussion
1
andrew_gallant
6 Messages
•
212 Points
7 years ago
I did run across one small technical issue. In `title.basics.tsv`, there are a few records that appear to be malformed. For example:
In this record, the `primaryTitle` and `originalTitle` fields appear to begin with a double quote, but there is no corresponding closing quote. The actual name of the title does start with a double quote, so I think the correct format would be:
Since the CSV/TSV format escapes quotes by doubling them.
Thanks!
1
ron3
386 Messages
•
9.4K Points
7 years ago
For example, mis-spellings in release date attributes. I take it this data is disappearing forever, given the very limited imdb-datasets items shown on the interfaces page.
Perhaps it might be better to list what data is remaining available, and what data is going away?
Thanks.
3
gp_hm83t2lf4wkw8
1 Message
•
80 Points
7 years ago
Admittedly, most cases of hacked accounts were the result of unintentionally published private keys on github. Still, for me, using such a service would give me sleepless nights, as I could never be sure not to wake up with a multi-thousand dollar debt.
This attitude is irresponsible from Amazon, and also means, that IMDB has no longer a reasonable public available data set.
1