Skip to main content

2 Messages

 • 

156 Points

Thu, Jul 16, 2020 3:29 PM

Tons of data corruption at Box Office Mojo

There appears to be mass corruption in the data set at Box Office Mojo. I am part of the editing team at Wikipedia and have been emailing the BOM team for several months now; while they are fairly responsive and fix the data within a few days new errors keep appearing. It is so bad the data set is now becoming unusable. The errors are too numerous to list so I will just list the three main types with several examples:

1) In the case of films that have had re-releases, many lifetime grosses count part of the original gross twice. One example would be Black Pantherhttps://www.boxofficemojo.com/title/tt1825683/. Its lifetime gross is now listed at over $2 billion because it is counting the original US gross twice: once in the original release figure and again in the 2020 reissue figure. The Dark Knight 2012 and 2020 reissue grosses are also double counting some 2008 grosses too: https://www.boxofficemojo.com/title/tt0468569/. Other examples can be found at https://en.wikipedia.org/wiki/Talk:List_of_highest-grossing_films#Rereleased_films'_gross. This problem is also affecting Harry Potter, Lord of the Rings and many other films that have had re-releases.

2) The second type of error is films listed under the wrong date. For example the 2011 chart (https://www.boxofficemojo.com/year/world/2011/) has Hotel Transylvania listed at #19 despite the fact it was released in 2012. Likewise for Mission Impossible 3 which is listed on the 2005 chart (https://www.boxofficemojo.com/year/world/2005/) despite it being released in 2006. Dead Poet's Society listed on 1988's list (https://www.boxofficemojo.com/year/world/1988/) despite coming out in 1989. There are dozens more cases.

3) The third type of error is corrupted weekend grosses. An example of this would be the 10th weekend gross for Like Water for Chocolate which BOM has in 1st place with a weekend gross of $19 million (https://www.boxofficemojo.com/weekend/1993W17/). Funnily enough The Numbers has Indecent Proposal at #1: https://www.the-numbers.com/box-office-chart/weekend/1993/04/23. In fact just compare your chart for memorial Weekend 1993 (https://www.boxofficemojo.com/weekend/1993W22/occasion/us_memorialday_weekend/) to the one at The Numbers (https://www.the-numbers.com/box-office-chart/weekend/1993/05/28). From #7 onwards they are basically different lists. And take it from me, The Numbers is correct for this particular weekend. This is a particularly grievous problem. Nearly every historical weekend chart on Box Office Mojo is corrupted in some way. The problem here seems to be algorithmic. Taking Sommersby (https://www.boxofficemojo.com/release/rl2960164353/weekend/) as an example, there seems to be some kind of algorithm that "fixes" the final logged weekend gross so that it matches the total gross e.g. the December 31, 2003 weekend has Sommersby grossing $7,494,058; this is not correct, but the figure represents the difference between the previous  total of $42,120,277 and the lifetime total of $50,081,992. This seems to occur on every historical weekend chart for each last "logged" weekend. It is quite likely the film continue playing for some time after this weekend but BOM simply doesn't have them logged.

Again, I must stress these are only examples and the amount of errors is prolific. The corruption has been the subject of much discussion at Wikipedia and we are on the brink of proscribing Box Office Mojo as a reliable source; see: 
This is not a criticism of the BOM team per se; generally they are helpful and responsive and fix errors when they are pointed out, but they are so systemic it feels a bit like "whack a mole". The whole data set appears to be corrupted. Since you now charge people to access this data IMDB has a duty to undertake some form of fact-checking and ensure it is correct. At the moment BOM data is not fit for purpose. For the record I think Box Office Mojo is a wonderful resource, and that is the reason I am raising the issues here.

Best regards,

Betty Logan (on behalf of the Film project at Wikipedia)

Responses

Employee

 • 

11 Messages

 • 

2.9K Points

3 months ago

Thanks for bringing this to our attention. We see the issues and are tackling this as an urgent priority.

Employee

 • 

330 Messages

 • 

40.2K Points

3 months ago

Hi Betty -

Thanks again so much for bringing this issue to our attention.  As a result of your detailed report, we were able to identify an error in our code, which our engineers are in the process of correcting.  Once we have tested the change, we will deploy it to the site and update the data.

Employee

 • 

330 Messages

 • 

40.2K Points

3 months ago

We have now made corrections for all the individual issues raised in points #1 and #2 of the original post, plus begun an investigation to find potential other cases of similar problems. We have also deployed a code change which should address the widespread issues reported in point #3. We’re grateful for your bringing these issues to our attention, and welcome further reports of any other issues you might find.

2 Messages

 • 

156 Points

2 months ago

We have now established an error logging page at Wikipedia which catalogs grosses that have been double-counted. You can view it here: 

https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Film/Film_finance_task_force#Box_Office_Mojo

We will continually update the list as we find errors. If you could have somebody check the list every couple of weeks it would be greatly appreciated.


Employee

 • 

11 Messages

 • 

2.9K Points

2 months ago

Thank you for this list, Betty! We have addressed all the issues that it currently shows.

Note that in some cases we have reassigned releases into different release groups, which means that they will no longer appear in the pages linked from that table (and in some cases, those links no longer work). To verify the fixes it's best to look at the overall picture from a title page.

The cause of these problems is that distributors are inconsistent when reporting data for re-releases; sometimes they start counting gross-to-date from zero, and sometimes they start reporting it from the last known value, even if it was decades ago. Sometimes they use the re-release date and sometimes they use the original release date, and what they choose to do varies by distributor/area/week. Box Office Mojo keeps track of grosses/GTD starting from zero for each individual re-release in each area, which many distributors don't do themselves, so it's often a case of us having to deduce concrete figures via heuristics based on limited data, intent, and history.

The situation is acute at the moment because with COVID-19 there's a glut of re-releases around the world dominating the charts and industry is scrambling with ad-hoc auditing and reporting procedures. We've got extra processes and reports in place now and will keep an eye on the Wikipedia page for future cases that you find, but until the industry stabilizes we're likely to see more new cases on top of the internal backlog we're working our way through. Thanks again!