sv_6654070's profile

15 Messages

 • 

820 Points

Thu, Jun 29, 2017 8:49 PM

IMDb Data – Now available in Amazon S3

This is an announcement for customers of the IMDb bulk data available via FTP.

We are pleased to announce, starting today IMDb datasets are now available in Amazon S3 via an HTTPS link. Using the new interface, customers can bulk-access IMDb title and name data.

For details on the S3 solution, file format and access guidelines, see www.imdb.com/interfaces.


In our continued effort to best serve our Contributors, we are streamlining the datasets and making them available in a more useful and structured format in S3. Notably:


  • Data refresh frequency is now daily (previously weekly).
  • IMDb title and name identifiers are included in all the files for ease of matching and linking back to IMDb.
  • The files are in tab separated values (TSV) format.
  • The sets of data we provide are updated to only include the essential ones that help with matching and linking to an IMDb title or name.
As part of housekeeping the FTP site, the data files will no longer be updated. The list data files will continue to be available at two locations (see below) until February 28, 2017. We strongly encourage FTP site users to switch to the S3 solution at the earliest to ensure their applications continue to work without interruption.

ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/frozendata

ftp://ftp.fu-berlin.de/pub/misc/movies/database/frozendata

 If you are not an IMDb Contributor and wish to obtain IMDb content for commercial use, we offer a content license.  The license grants you access to our content via an XML web service, plus the right to use the content in your product or service.  If that interests you, please email licensing@imdb.com.

 If you have any questions or concerns, please share your feedback in this thread.

 Thank you for your continued support.

Responses

15 Messages

 • 

820 Points

4 y ago

Thank you for your feedbackregarding the launch of IMDb datasets in S3.

We will be sure to take all the feedback into consideration moving forward as we consider further updates to this product offering.

16 Messages

 • 

792 Points

So in English, "we are going ahead with our plans to cut access, despite the negative feedback" and "Nothing changes in the short term, data will be cut off"

Do I have that right?

2 Messages

 • 

170 Points

4 y ago

For my use cases, which is a personal TV listings application, I am missing aka-titles.list, ratings.list and language.list.

I use aka-titles.list to match TV listing data to IMDB info, because the TV listing data often uses an alternative name for movies.

For ratings.list, unless I am missing something, the top 250 information is not available in the new format, which I use to highlight those movies in the TV listings as well as to track my progress in watching those movies.

I use language.list to show the original language of movies as opposed to the language being broadcasted.

2 Messages

 • 

174 Points

4 y ago

For my personal TV application the aka-titles.list is really a requirement. Consider that outside USA you really need to use the translated titles, as the English and original titles are just unknown to most people.

The keyword.list data is also required, as it allows to categorize movies.

Also please continue to share the data file with FTP/HTTP. Most people won't start using and paying S3 for their personal projects just to obtain such data.

You are just forcing people to start scraping imdb.com to obtain the same information that before was easily and freely obtainable from FTP/HTTP.

16 Messages

 • 

792 Points

4 y ago

It is sad to see the end of what Col Needham started.

I wonder what he would think of this direction for his project, it certainly flies in the face of the entire nature of IMDB's history and values.

Those that continue to need this sort of access, might want to check out https://www.themoviedb.org

2 Messages

 • 

174 Points

Even more frustrating is that is not understandable the reason of doing this.

Really, why forcing to use S3 instead of a normal FTP/HTTP download ? And why removing so much useful info ?

Would be interesting to know if Col Needham is really aware of this...

2.4K Messages

 • 

81.2K Points

I am afraid so, as he replied to a post above...

16 Messages

 • 

792 Points

I missed that thanks, I'll note that he didn't really say anything supporting the move, mostly just how to do something.  That he didn't announce this, isn't voicing any support for this to date speaks volumes.  
After so long it seems a shame to move to another service, IMDB used to be synonymous with the best source of movie data.  Guess its now a marketing thing.  
What a waste of decades of people contributing to make it better, and now, its something to sell Amazon goods and services.

1 Message

 • 

220 Points

4 y ago

This seems to negate the bidirectional, symbiotic nature of IMDb. It seems to purposefully ignore that IMDb wouldn't be the "same" if it wouldn't be for both the contributors and for ... IMDb itself.  One without the other(s) and the IMDb data (what "IMDb" is at its core) of today would be many times of "poorer" quality. It is true that there is a fundamental asymmetry in how IMDb works, but that hasn't never been the problem, more like its strength: the contributors which contribute the most aren't certainly the ones whom benefit profit/consume the most data; contributors give time, IMDb time and all the (necessary) rest.

This "move" from "IMDb" seems unnecessary because there other models that would work in reducing operational costs and data consumption greatly without the need to subject the bona fide data consumers to this convoluted, Rube Goldberg-esque, nonsensical process: the first that comes to mind, without putting too much thought into it, and with trivial impact (for both interested parties) it would be to maintain the FTP process but, instead of being anonymous, make it to a previously assigned user:password (maybe the same as the IMBd login credentials?) and to a specific time-based quota (X files/bytes could be downloaded daily/monthly/yearly/whatever per FTP/IMDb user). That would separate the "abusers" from the "little guy" (it's seemingly ironical how the ones that would be most damaged by the "new" process would be the ones that "consume" the less and, probably, contribute the most...; this "move" equates to shooting birds with a cannon, not the right tool for the job: if the aim is to deter and fight abuse [I infer, as no "plausible" motive was forwarded, the other alternative being some misguided corporate bottom-line measure that simply didn't account for all the factors in their equation -- meaning, there's more to be lost by possibly alienating contributors than to be gained] why does it "burn" all the rest -- incidental users: the vast majority, I surmise -- at the same time? So unfair and a whole lot of "bad-Karma points"...)

Plus, this "move" from "IMDb" seems also shortsighted because it fails to understand the fundamental reason why IMDb is, well, IMDb. It wouldn't be the same quality-wise (meaning data quality) and quantity-wise without the army(s) of busy bees (AKA contributors) constantly mending/tending/pruning/trimming/grafting, 24/7, every square inch of their "digital property". Try to convert that (if you can) to the army(s) of employees that had to be payed (plus benefits, plus ...) to work less hours and with lesser effectiveness (a labor of love, isn't really labor, and it can't really be matched). That would be the scenario of having IMDb without happy/motivated contributors (which is where "moves" just like this one would lead, eventually, down the road).

And if IMDb considers that this isn't a problem, because it's "too big to fail", right now and it can do as it pleases just ... because, I'll leave you with this 'food for thought': if (hopefully not) for some unfortunate reason IMDb would cease to exist today, and all the data would be lost, a new IMDb would rise tomorrow (let's call it for argument's sake, NewIMDb), the (existing) contributors would flock over it and in a short amount of time something resembling (Old)IMDb would be there, standing proud and tall -- (Old)IMDb nothing but a fading memory... If the exact opposite were to happen (meaning, all the contributors ceased, for some unwanted, unforeseen reason, to stop supporting, visiting and contributing every minute to the collective hive that is IMDb, the same (i.e., IMDb) would slowly -- but surely -- wither and "die" (vanish).

The longevity (and quality) of IMDb is due to its openness and two-way street nature, not in-spite of it.

Final disclaimer: I love IMDb and want it to outlive the planet itself, but I couldn't just watch doing this to itself and stand idle. I also know this isn't a "conversation" (a conversation doesn't start: "we are doing this, what do you think?"; a conversation starts with: "what do you think of the possibility of changing from this to that?") but I hope smarter (smartest) heads will prevail. For IMDb's sake and success. Including in that all of IMBd's ecosystem. Thank you.



PS: this what I would also have written if, all of a sudden, Wikipedia started erecting some kind of fences/barriers to some type of its content. NewWikipedia would be fast in its place, without missing a beat...

6 Messages

 • 

278 Points

The main difference between IMDb and other user-contributed data sources like Wikipedia is that IMDb has a large userbase of industry professionals who are required to update their projects and remove false information that crops up, at least for new and upcoming releases.  So any competitor to IMDb would have to find a replacement for that data.

I have a big investment in code to parse the current .list files, so like many people, I would prefer that the format remain as it is.  I'd even be willing to pay an amount proportional to my personal, non-commercial use to access the data - perhaps $40 per year for weekly updates via FTP.

2.4K Messages

 • 

81.2K Points

Two reactions to this:
1. I find the new data sets are much easier to integrate since they are build around the title and name codes (tt0000000 and nm0000000), as stable keys, where in the previous sets, the IMDb title was the (very heavy key), and moreover subject to change over time, typo corrections, language change... (And I have also heavily invested myself in parsing scripts from the former .lists)

2. The role of IMDB for US industry professionals have narrowed the initial universal scope of the database, which now only stick to credits (for what it is worth...). E.g. the uncredited actors, the fired directors, etc. have been removed and only cited in trivia. This leaves room for less industry-driven competitors

3 Messages

 • 

114 Points

4 y ago

This reply was created from a merged topic originally titled Alternative interfaces old format is not available anymore?.

Hi guys!

Sometimes I used to download from Alternative Interfaces files like actors.list, movies.list and so on. I used them mostly for some machine learning educational purposes. But a couple of weeks ago I discovered that these files are not available anymore. Only interface with connecting to AWS database is proposed. Which is paid and honestly speaking I'm not very familiar with all this aws stuff. Just plain text files were more than enough for me.

Is there any way to get these files up to date or they are deprecated and gone forever?

15 Messages

 • 

820 Points

The .list files continue to be available for download and use from the FTP sites. But they will be retired on September 10th. We strongly encourage customers to switch to the S3 solution for uninterrupted access.

Here is the link to the FTP sites:
ftp://ftp.funet.fi/pub/mirrors/ftp.im...
ftp://ftp.fu-berlin.de/pub/misc/movie...

15 Messages

 • 

820 Points

4 y ago

We greatly appreciate our customers sharing their usecases and concerns with the new IMDb Datasets. After careful consideration,we will be adding TitleAKAs to the datasets in S3. This will be available in the coming weeks in the S3 bucket: imdb-datasets. We will update this thread and www.imdb.com/interfaces page once we have more details.

Thank you for your continued support.

36 Messages

 • 

1.2K Points

I appreciate that the AKAs were added. I would like it if you can keep adding more data until we have everything that we used to have before. In particular I'd like to have attributes for director roles listed. After that I'd love to have the full list of genres for each film, company credits, production countries, and languages spoken as some of the next items added.

3 Messages

 • 

114 Points

4 y ago

Thank you so much!

3 Messages

 • 

304 Points

4 y ago

I am horrendously disappointed by this change. I actually legitimately cried when my IMDB database updater threw a 404.
I get that you guys want money and promoting s3, but its kind of a dirty tactic to force me to pay for something that I have spent almost two years developing and creating my own little in house 'Netflix' service for my personal collection of movies and tv, around all of the data that you guys so graciously provided us with prior.
My system was already setup to prevent strain on bandwidth on your end, only once a month it would check for updates, and then re-download updates/diffs, because I use it all...
I use the titles, all the akas, keywords/taglines/writers/quotes/ratings-reasons/locations/genres/composers/actresses/actors/directors/countries/distributors are all intricately tied into my searching, running-times, ratings, release dates, and various other files are all meshed into a distrusted MySQL database allowing me near real-time data draw.
To know I am going to HAVE to re-build, re-structure, and re-engineer the entire way my system is built around you data truly means I will probably have to scrap, and abandon the entire project.
Albeit it was just for my own personal fun it saddens me to see you guys make a change that breaks two years of my fun time work.

16 Messages

 • 

792 Points

Very well said.
Given that the clients are now paying the cost of transmitting the data, It is totally unclear why imdb is so focused on withholding data.  
This feels non-technical in nature, the use of this alternate data cannot possibly be hurting the revenue model for IMDB, and with "requestor pays" there is no cost to transmit the data.
It is all very curious, and I wish imdb(amazon) would more openly explain why they are doing this.

3 Messages

 • 

304 Points

I agree as well. Am very curious as to why or what motivation they have to make this change. I mean its been that way for 20 years now, so why the sudden change. I get and am totally all for improving/innovation, but to entirely cut people out seems backwards.

1 Message

 • 

62 Points

4 y ago

Hello,

does the distributors data will still be available thanks to s3 ? 

2 Messages

 • 

110 Points

4 y ago

Hi,

May I ask that you distribute the locations.list file please? I'm using it to get localized movie information of places around the world. 

Also, an example using the AWS CLI would be much appreciated. I'm having issues accessing the new endpoint. Am I doing something wrong?

aws s3 cp s3://imdb-datasets/documents/v1/current/name.basics.tsv.gz name.basics.tsv.gz
A client error (403) occurred when calling the HeadObject operation: Forbidden

6 Messages

 • 

212 Points

The files in s3 are configured so that the requester is forced to pay, which requires sending an extra config knob to AWS to acknowledge that you're willing to pay for the request. The `aws` CLI tool doesn't completely support this knob. It's available when using `aws s3 ls` but not `aws s3 cp`. For example, this works (assuming your AWS credentials are setup properly):

aws s3 ls s3://imdb-datasets/documents/v1/current/ --request-payer requester

While `aws s3 cp` doesn't support the requester pays option, the `s3cmd` tool does. So to copy all of the current IMDb datasets, you can do this:

mkdir imdb
s3cmd sync s3://imdb-datasets/documents/v1/current ./imdb --requester-pays

And that's pretty much it!

16 Messages

 • 

792 Points

Some of the 3rd party ftp clients also do not natively support sending this flag.  I have approached the makers of the market leading Mac client "transmit" to add this feature, but for now it remains unsupported there as well.

6 Messages

 • 

212 Points

Note that this is for s3, which isn't FTP.

16 Messages

 • 

792 Points

Oops, my bad, I should have been clearer. Transmit supports s3: protocol as well as many others (google cloud, one drive, etc).  I should have been more clear in the category, this category originated as ftp clients but they all support many protocols now :)

2 Messages

 • 

110 Points

Thanks a ton for the help Andrew!

1 Message

 • 

160 Points

4 y ago

Have you considered to make available the new files also with FTP on a less frequent time base ?

For example, you can have daily updates on S3 and monthly updates on FTP.

In this way you can both promote S3, and at the same time don't disrupt personal and open source projects. S3 is just not an option for a lot of us.

Having keywords info would be also very good.

Thanks

8.1K Messages

 • 

184.5K Points

4 y ago

This does not improve anything for bulk data users, and thus far, it serves to discontinue the availability of certain information, as pointed out earlier by others submitting remarks to this GS topic.

Back in 2014 or so, Netflix (which uses S3) discontinued public access to its API altogether and even limited existing access to a select portion of established users. We shouldn't even be surprised by this move. Things like this seem to be a theme of the 2010s. (For that, Amazon does not deserve all of the blame.)

O, and if there is to be the remark "don't knock it until you've tried it", I'm none too pleased that I have to present credit card information or any information about my identity and "billing address" (often not separate from mailing address) at all in order to "try" it, even with the time-limited free access (as newcomers may not necessarily be ready to make the best time-limited use of all the things that Amazon Web Services offer). I'm sure countless other users feel the same way as I do. Hopefully, there will be some S3 users who will broker out the information stored in IMDb's "data sets" S3 bucket, on agreeable terms, and hopefully diff files will be made available (to spare S3 and its Internet Service Providers of their bandwidth burdens).

3 Messages

 • 

246 Points

4 y ago

This is a braindead idea. From someone who doesn’t understand (or doesn’t want to understand) why IMDb came to be the number one site for movie & TV data.

I’ll explain.

Countless Contributors offered their time, for free (and time is more valuable than money, once it’s gone, it’s gone, one cannot recover it, unlike money), during thousands of hours, spanning decades to make sure IMDb’s data was correct and up-to-date. Imagine how many employees, man-hours, health benefits and all the rest that IMDb saved since the start... Just imagine...

Then, once the data is there -- good, verified, and perfect – they sell it. Multiple times. Many, many times. More and more as time goes by. There are many ways they can profit from it (direct and indirect). Even when the access is free there are valuable tie-ins (you can buy the movie from Amazon for instance, movie theaters, merchandising) where money can be made. All of this is fine. Really. It’s a business. That is data and that it can be “resold” (because is better than data the other companies can offer) over and over, until infinity. It’s a good consequence of being just data and not something physical (that you cannot sell more than once).

Summarizing, until now: many, many people give away their time, for free, to make good data for IMDb, which in turn makes ample use of it, a many times as it wants, to make money.

Now picture this. You know those people, who give away their time to make something for free for you that you can make money later, as many times as you want? Here’s an idea, why not make them pay too? Right? Neat idea, isn’t it? They gave their time away so they should give away their money as well, right? Right.

A lousy analogy would be an Airline company making the pilots and crew to buy tickets to be able to board the plane they are supposed to take to its destination. They make the same trip, right? Why shouldn’t they pay, right? It makes perfect sense, right? Right.

The point is: it’s not that they will lose money if they don’t charge Contributors: that can be done in many, many other ways. It’s not that they need to do this. It’s that they want to do this.

I want to leave here some rhetorical questions that boggle the mind.

1 – Why should a Contributor (any Contributor) keep contributing after this? Why should anyone want to? Why should they not contribute their time where is really appreciated?

2 – Will you, from now on, start to pay the Contributors for their contribution? Because every coin has 2 sides.   Because you cannot have it both ways, you cannot have your cake and eat it too: if you charge Contributors it means you say the information (which they provided in the first place) is valuable and so it should be remunerated. Or you don’t charge them for it, because you also got it from them for free. As I said, it can be both. Because we all have this things called little grey cells (borrowing from Poirot here)...
And if you retort: “It’s too difficult to pay, because it’s complicated accounting, because... “ and so on and so on... I’ll give you back your own “solution”. The “unnatural” solution you are trying to force-feed to people right now, but this one actually makes perfect sense. Also a simple solution. Give to each, say, 10 contributions of each Contributor an Amazon Gift Card with a token value, say, 1 Dollar/Euro/Pound. There. No need to thank me.

3 – Why do you thought you could do this and everyone would be fine with it? Why do you think it’s okay to ask for money from the one’s that help you most and they would be fine with it and everything would be fine afterwards? And don’t give me the “added value” line because it doesn’t pass the smell test: you cannot add value if you take away loads and loads of data. Just can’t.

4 – Why is this so rushed and quiet and through the summer (I bet many aren’t even aware of this)?

As I said, rhetorical questions.

Braindead, I say.


2.4K Messages

 • 

81.2K Points

I wish that Amazon's answer to your questions (which I support, along with your position) is as effective and swift as their customer service... But this is probably the critical point: in a previous answer above, the Amazon "official" speaks of customers, when the people reacting here to this move are contributors, probably as thin as my little finger compared to a Saturday night crowd at the movies... It is all about balance of powers. Considering the weight IMDb has gained in the industry (at least on the US side), I am not sure that all the new data IMDb gets every week come from disinterested contributors like us.

I have another question to Amazon: what would be the cost of maintaining the FTP files, say on a monthly basis instead of a weekly one? I would be ready to pay a (very reasonable) fee to keep access to this data. Especially if it is discounted when I contribute to the data feed and update, as proposed by Valen above. This would seem fair in your merchandization and monetization of what use to be a fantastic collaborative and open and free project, don't you think?

V.

3 Messages

 • 

246 Points

Actually no. People do not understand that Contributors are the real "secret sauce" behind IMDb’s success. They (Contributors) really sell themselves short. Data will "rot" very easily, if it goes without constant verification and supervision. Think of it as a green lawn. Without proper and constant maintenance it would turn into a "jungle" of sorts in a short amount of time.

Somebody mentioned Wikipedia and is a bit like that too. Without constant attention Wikipedia would quickly become unusable and worthless. And this change is even more braindead because now more than ever they will need to keep their Contributors happy because data input will grow more and more as time passes. And this is the time that they thought it be a good time to show Contributors (which they'll need more and more of as time goes by) this (symbolical) huge middle finger. As plain, common, sense, one does not (should not) antagonize the ones that one depends on.

Another way of looking at it (which is also one of my favorites) is that IMDb is the longest running, most successful case of Crowd Funding ever (it's still going at it), anticipating that movement by many many years. And the reason it worked so well and so successfully, until now, is because of 2 things, IMO: 1. Instead of money (which people tend to have more difficulty parting with, curiously enough) it asked "only" for time (which people donate more easily even if it is more valuable); and 2. The "rewards" were also very simple and easily understandable, "you give us your time and we won't cut you off when you want to make use of the data, for you own private pleasure”. There’s nothing simpler than that! And so, what do they want to do now? 1. "Give us your time AND your money too"; and 2. "Rewards? No. None of that. Here’s a reduced version of the data we chose for you IF you pay for it, like any other customer. But no rewards.". Genius!

You rightly pointed out that data is more and more coming from other interested parties. But that is only a small part of it. All data has errors, inconsistencies, typos, or it can be just plain wrong. "Dumping" data is just the first small, step. It's the eyes of the thousands (more than that) that little by little turn it into quality data. Elsewhere was a comparison with bees but I see it more as an ant colony where each ant (Contributor) does just a little bit every time but that is essential for the success of the whole colony. And one ant is dispensable but if you take all the ants from the colony it'll disappear very soon. These thousands and thousands of micro-corrections forwarded by Contributors resulted in IMDb data today. And don't get me wrong, there are errors there now. There will always be errors there. Just not the same ones and their importance and scale will be smaller and smaller with time. Providing happy Contributors still do their work, as always.

If I can misquote from "Soylent Green": "IMDb is people!”. In this case, Contributors. That is the reason it has the better (best) data. And you can quote me on that.

4 Messages

 • 

310 Points

4 y ago

I have no issue paying to download IMDb datasets for my own personnel use.

What I do have a problem with is moving to a paid model that does not support the same 40+, frequently updated, datasets that have been provided for free for so many years. I think IMDb need to re-address this decision as it will affect people like myself. 

I download and insert all datasets into a MySQL database. This has allowed me to develop specific applications and just data mine (discovering new uses for the data). I've taught myself SQL from data-mining IMDb data.

1 Message

 • 

104 Points

This matches what I do with it, except additionally I am moving toward mining the data for non-profit academic research. I agree the most disappointing thing is the extremely limited data this seems to be turning into.

On a curiousity-note, would love to connect with you and see what you are doing with the data.