sv_6654070's profile

15 Messages

 • 

820 Points

Thu, Jun 29, 2017 8:49 PM

IMDb Data – Now available in Amazon S3

This is an announcement for customers of the IMDb bulk data available via FTP.

We are pleased to announce, starting today IMDb datasets are now available in Amazon S3 via an HTTPS link. Using the new interface, customers can bulk-access IMDb title and name data.

For details on the S3 solution, file format and access guidelines, see www.imdb.com/interfaces.


In our continued effort to best serve our Contributors, we are streamlining the datasets and making them available in a more useful and structured format in S3. Notably:


  • Data refresh frequency is now daily (previously weekly).
  • IMDb title and name identifiers are included in all the files for ease of matching and linking back to IMDb.
  • The files are in tab separated values (TSV) format.
  • The sets of data we provide are updated to only include the essential ones that help with matching and linking to an IMDb title or name.
As part of housekeeping the FTP site, the data files will no longer be updated. The list data files will continue to be available at two locations (see below) until February 28, 2017. We strongly encourage FTP site users to switch to the S3 solution at the earliest to ensure their applications continue to work without interruption.

ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/frozendata

ftp://ftp.fu-berlin.de/pub/misc/movies/database/frozendata

 If you are not an IMDb Contributor and wish to obtain IMDb content for commercial use, we offer a content license.  The license grants you access to our content via an XML web service, plus the right to use the content in your product or service.  If that interests you, please email licensing@imdb.com.

 If you have any questions or concerns, please share your feedback in this thread.

 Thank you for your continued support.

Responses

15 Messages

 • 

820 Points

4 y ago

Thank you for your feedbackregarding the launch of IMDb datasets in S3.

We will be sure to take all the feedback into consideration moving forward as we consider further updates to this product offering.

2 Messages

 • 

170 Points

4 y ago

For my use cases, which is a personal TV listings application, I am missing aka-titles.list, ratings.list and language.list.

I use aka-titles.list to match TV listing data to IMDB info, because the TV listing data often uses an alternative name for movies.

For ratings.list, unless I am missing something, the top 250 information is not available in the new format, which I use to highlight those movies in the TV listings as well as to track my progress in watching those movies.

I use language.list to show the original language of movies as opposed to the language being broadcasted.

2 Messages

 • 

174 Points

4 y ago

For my personal TV application the aka-titles.list is really a requirement. Consider that outside USA you really need to use the translated titles, as the English and original titles are just unknown to most people.

The keyword.list data is also required, as it allows to categorize movies.

Also please continue to share the data file with FTP/HTTP. Most people won't start using and paying S3 for their personal projects just to obtain such data.

You are just forcing people to start scraping imdb.com to obtain the same information that before was easily and freely obtainable from FTP/HTTP.

16 Messages

 • 

792 Points

4 y ago

It is sad to see the end of what Col Needham started.

I wonder what he would think of this direction for his project, it certainly flies in the face of the entire nature of IMDB's history and values.

Those that continue to need this sort of access, might want to check out https://www.themoviedb.org

1 Message

 • 

220 Points

4 y ago

This seems to negate the bidirectional, symbiotic nature of IMDb. It seems to purposefully ignore that IMDb wouldn't be the "same" if it wouldn't be for both the contributors and for ... IMDb itself.  One without the other(s) and the IMDb data (what "IMDb" is at its core) of today would be many times of "poorer" quality. It is true that there is a fundamental asymmetry in how IMDb works, but that hasn't never been the problem, more like its strength: the contributors which contribute the most aren't certainly the ones whom benefit profit/consume the most data; contributors give time, IMDb time and all the (necessary) rest.

This "move" from "IMDb" seems unnecessary because there other models that would work in reducing operational costs and data consumption greatly without the need to subject the bona fide data consumers to this convoluted, Rube Goldberg-esque, nonsensical process: the first that comes to mind, without putting too much thought into it, and with trivial impact (for both interested parties) it would be to maintain the FTP process but, instead of being anonymous, make it to a previously assigned user:password (maybe the same as the IMBd login credentials?) and to a specific time-based quota (X files/bytes could be downloaded daily/monthly/yearly/whatever per FTP/IMDb user). That would separate the "abusers" from the "little guy" (it's seemingly ironical how the ones that would be most damaged by the "new" process would be the ones that "consume" the less and, probably, contribute the most...; this "move" equates to shooting birds with a cannon, not the right tool for the job: if the aim is to deter and fight abuse [I infer, as no "plausible" motive was forwarded, the other alternative being some misguided corporate bottom-line measure that simply didn't account for all the factors in their equation -- meaning, there's more to be lost by possibly alienating contributors than to be gained] why does it "burn" all the rest -- incidental users: the vast majority, I surmise -- at the same time? So unfair and a whole lot of "bad-Karma points"...)

Plus, this "move" from "IMDb" seems also shortsighted because it fails to understand the fundamental reason why IMDb is, well, IMDb. It wouldn't be the same quality-wise (meaning data quality) and quantity-wise without the army(s) of busy bees (AKA contributors) constantly mending/tending/pruning/trimming/grafting, 24/7, every square inch of their "digital property". Try to convert that (if you can) to the army(s) of employees that had to be payed (plus benefits, plus ...) to work less hours and with lesser effectiveness (a labor of love, isn't really labor, and it can't really be matched). That would be the scenario of having IMDb without happy/motivated contributors (which is where "moves" just like this one would lead, eventually, down the road).

And if IMDb considers that this isn't a problem, because it's "too big to fail", right now and it can do as it pleases just ... because, I'll leave you with this 'food for thought': if (hopefully not) for some unfortunate reason IMDb would cease to exist today, and all the data would be lost, a new IMDb would rise tomorrow (let's call it for argument's sake, NewIMDb), the (existing) contributors would flock over it and in a short amount of time something resembling (Old)IMDb would be there, standing proud and tall -- (Old)IMDb nothing but a fading memory... If the exact opposite were to happen (meaning, all the contributors ceased, for some unwanted, unforeseen reason, to stop supporting, visiting and contributing every minute to the collective hive that is IMDb, the same (i.e., IMDb) would slowly -- but surely -- wither and "die" (vanish).

The longevity (and quality) of IMDb is due to its openness and two-way street nature, not in-spite of it.

Final disclaimer: I love IMDb and want it to outlive the planet itself, but I couldn't just watch doing this to itself and stand idle. I also know this isn't a "conversation" (a conversation doesn't start: "we are doing this, what do you think?"; a conversation starts with: "what do you think of the possibility of changing from this to that?") but I hope smarter (smartest) heads will prevail. For IMDb's sake and success. Including in that all of IMBd's ecosystem. Thank you.



PS: this what I would also have written if, all of a sudden, Wikipedia started erecting some kind of fences/barriers to some type of its content. NewWikipedia would be fast in its place, without missing a beat...

3 Messages

 • 

114 Points

4 y ago

This reply was created from a merged topic originally titled Alternative interfaces old format is not available anymore?.

Hi guys!

Sometimes I used to download from Alternative Interfaces files like actors.list, movies.list and so on. I used them mostly for some machine learning educational purposes. But a couple of weeks ago I discovered that these files are not available anymore. Only interface with connecting to AWS database is proposed. Which is paid and honestly speaking I'm not very familiar with all this aws stuff. Just plain text files were more than enough for me.

Is there any way to get these files up to date or they are deprecated and gone forever?

15 Messages

 • 

820 Points

4 y ago

We greatly appreciate our customers sharing their usecases and concerns with the new IMDb Datasets. After careful consideration,we will be adding TitleAKAs to the datasets in S3. This will be available in the coming weeks in the S3 bucket: imdb-datasets. We will update this thread and www.imdb.com/interfaces page once we have more details.

Thank you for your continued support.

3 Messages

 • 

114 Points

4 y ago

Thank you so much!

3 Messages

 • 

304 Points

4 y ago

I am horrendously disappointed by this change. I actually legitimately cried when my IMDB database updater threw a 404.
I get that you guys want money and promoting s3, but its kind of a dirty tactic to force me to pay for something that I have spent almost two years developing and creating my own little in house 'Netflix' service for my personal collection of movies and tv, around all of the data that you guys so graciously provided us with prior.
My system was already setup to prevent strain on bandwidth on your end, only once a month it would check for updates, and then re-download updates/diffs, because I use it all...
I use the titles, all the akas, keywords/taglines/writers/quotes/ratings-reasons/locations/genres/composers/actresses/actors/directors/countries/distributors are all intricately tied into my searching, running-times, ratings, release dates, and various other files are all meshed into a distrusted MySQL database allowing me near real-time data draw.
To know I am going to HAVE to re-build, re-structure, and re-engineer the entire way my system is built around you data truly means I will probably have to scrap, and abandon the entire project.
Albeit it was just for my own personal fun it saddens me to see you guys make a change that breaks two years of my fun time work.

1 Message

 • 

62 Points

4 y ago

Hello,

does the distributors data will still be available thanks to s3 ? 

2 Messages

 • 

110 Points

4 y ago

Hi,

May I ask that you distribute the locations.list file please? I'm using it to get localized movie information of places around the world. 

Also, an example using the AWS CLI would be much appreciated. I'm having issues accessing the new endpoint. Am I doing something wrong?

aws s3 cp s3://imdb-datasets/documents/v1/current/name.basics.tsv.gz name.basics.tsv.gz
A client error (403) occurred when calling the HeadObject operation: Forbidden

1 Message

 • 

160 Points

4 y ago

Have you considered to make available the new files also with FTP on a less frequent time base ?

For example, you can have daily updates on S3 and monthly updates on FTP.

In this way you can both promote S3, and at the same time don't disrupt personal and open source projects. S3 is just not an option for a lot of us.

Having keywords info would be also very good.

Thanks

8K Messages

 • 

183.5K Points

4 y ago

This does not improve anything for bulk data users, and thus far, it serves to discontinue the availability of certain information, as pointed out earlier by others submitting remarks to this GS topic.

Back in 2014 or so, Netflix (which uses S3) discontinued public access to its API altogether and even limited existing access to a select portion of established users. We shouldn't even be surprised by this move. Things like this seem to be a theme of the 2010s. (For that, Amazon does not deserve all of the blame.)

O, and if there is to be the remark "don't knock it until you've tried it", I'm none too pleased that I have to present credit card information or any information about my identity and "billing address" (often not separate from mailing address) at all in order to "try" it, even with the time-limited free access (as newcomers may not necessarily be ready to make the best time-limited use of all the things that Amazon Web Services offer). I'm sure countless other users feel the same way as I do. Hopefully, there will be some S3 users who will broker out the information stored in IMDb's "data sets" S3 bucket, on agreeable terms, and hopefully diff files will be made available (to spare S3 and its Internet Service Providers of their bandwidth burdens).

3 Messages

 • 

246 Points

4 y ago

This is a braindead idea. From someone who doesn’t understand (or doesn’t want to understand) why IMDb came to be the number one site for movie & TV data.

I’ll explain.

Countless Contributors offered their time, for free (and time is more valuable than money, once it’s gone, it’s gone, one cannot recover it, unlike money), during thousands of hours, spanning decades to make sure IMDb’s data was correct and up-to-date. Imagine how many employees, man-hours, health benefits and all the rest that IMDb saved since the start... Just imagine...

Then, once the data is there -- good, verified, and perfect – they sell it. Multiple times. Many, many times. More and more as time goes by. There are many ways they can profit from it (direct and indirect). Even when the access is free there are valuable tie-ins (you can buy the movie from Amazon for instance, movie theaters, merchandising) where money can be made. All of this is fine. Really. It’s a business. That is data and that it can be “resold” (because is better than data the other companies can offer) over and over, until infinity. It’s a good consequence of being just data and not something physical (that you cannot sell more than once).

Summarizing, until now: many, many people give away their time, for free, to make good data for IMDb, which in turn makes ample use of it, a many times as it wants, to make money.

Now picture this. You know those people, who give away their time to make something for free for you that you can make money later, as many times as you want? Here’s an idea, why not make them pay too? Right? Neat idea, isn’t it? They gave their time away so they should give away their money as well, right? Right.

A lousy analogy would be an Airline company making the pilots and crew to buy tickets to be able to board the plane they are supposed to take to its destination. They make the same trip, right? Why shouldn’t they pay, right? It makes perfect sense, right? Right.

The point is: it’s not that they will lose money if they don’t charge Contributors: that can be done in many, many other ways. It’s not that they need to do this. It’s that they want to do this.

I want to leave here some rhetorical questions that boggle the mind.

1 – Why should a Contributor (any Contributor) keep contributing after this? Why should anyone want to? Why should they not contribute their time where is really appreciated?

2 – Will you, from now on, start to pay the Contributors for their contribution? Because every coin has 2 sides.   Because you cannot have it both ways, you cannot have your cake and eat it too: if you charge Contributors it means you say the information (which they provided in the first place) is valuable and so it should be remunerated. Or you don’t charge them for it, because you also got it from them for free. As I said, it can be both. Because we all have this things called little grey cells (borrowing from Poirot here)...
And if you retort: “It’s too difficult to pay, because it’s complicated accounting, because... “ and so on and so on... I’ll give you back your own “solution”. The “unnatural” solution you are trying to force-feed to people right now, but this one actually makes perfect sense. Also a simple solution. Give to each, say, 10 contributions of each Contributor an Amazon Gift Card with a token value, say, 1 Dollar/Euro/Pound. There. No need to thank me.

3 – Why do you thought you could do this and everyone would be fine with it? Why do you think it’s okay to ask for money from the one’s that help you most and they would be fine with it and everything would be fine afterwards? And don’t give me the “added value” line because it doesn’t pass the smell test: you cannot add value if you take away loads and loads of data. Just can’t.

4 – Why is this so rushed and quiet and through the summer (I bet many aren’t even aware of this)?

As I said, rhetorical questions.

Braindead, I say.


4 Messages

 • 

310 Points

4 y ago

I have no issue paying to download IMDb datasets for my own personnel use.

What I do have a problem with is moving to a paid model that does not support the same 40+, frequently updated, datasets that have been provided for free for so many years. I think IMDb need to re-address this decision as it will affect people like myself. 

I download and insert all datasets into a MySQL database. This has allowed me to develop specific applications and just data mine (discovering new uses for the data). I've taught myself SQL from data-mining IMDb data.