Marco's profile

2.7K Messages

 • 

83K Points

Tuesday, May 29th, 2018 2:21 PM

13

Add a process to clean-up empty name pages on IMDb

I just came across the name page for a Grace Anne Cochran (https://www.imdb.com/name/nm5507911/). It's not a new name page, there are no credits listed, there is no other information listed and a check via the Edit Page also didn't show any credits. Therefore, I think this page should be deleted, but maybe there's more than meets my eye.

Champion

 • 

19.4K Messages

 • 

477.1K Points

6 years ago

Marco,

It looks like this name page has been on IMDb since at least 2014. There is one Internet Archive snapshot, which also shows that the page is blank.
https://web.archive.org/web/20170218081242/https://www.imdb.com/name/nm5507911/

2.7K Messages

 • 

83K Points

Thanks for the digging around Dan.

Employee

 • 

7.3K Messages

 • 

179.2K Points

6 years ago

Thanks, we have re-titled this post and converted it into an idea for future consideration. 

This type of clean-up is best handled by an automated process which will fix the issue once and for all. 

If there are other cases, please report them here and it will add weight to the necessity to create this process.  Unfortunately it is simply not scalable to take one-off requests for every empty name, sorry.  For an example of the scale, in 5+ years 36,080 threads have been posted to Get Satisfaction -- we estimate that out of over 8 million names on IMDb, more than 36K are empty (< 0.5% BTW)

2.7K Messages

 • 

83K Points

Thanks for the response Col. If I see any other cases, I'll report them on this thread. Perhaps another contributor can compile a list of (almost) all empty name pages.
we estimate that out of over 8 million names on IMDb, more than 36K are empty (< 0.5% BTW)
Just out of curiosity, how did you get that number?

Employee

 • 

7.3K Messages

 • 

179.2K Points

It's an educated guess taking into account the number of empty names removed the last time such a clean-up was performed, the time elapsed since then, the growth in submission rates, and changes in the back-end technology which affect the frequency of their creation. It could be way out in either direction, but the point on one-off reports not be scalable remains true nonetheless. 

The only thing for which I did not account is the variable mass of the whales and the water, see https://www.imdb.com/title/tt0092007/quotes/qt0444227  :-)

2.7K Messages

 • 

83K Points

Thanks for the explanation Col, and I of course understand the one-off reports aren't the best solution for this problem.
Yeah, the variable mass of the whales and the water is always tricky to account for :)

8.5K Messages

 • 

176.2K Points

Hi! Col Needham, Official Rep

Unfortunately it is simply not scalable to take one-off requests for every empty name, sorry

- - -

What about Titles with No credits listed here
a few samples...

https://getsatisfaction.com/imdb/topics/-what-is-a-clip-jk8v1aue6aik
[ What is a 'Clip:' ? ]  ( no credits )

and many titles - "The requested URL was not found on our server"

Posted May 31
by ACT_1

How many of the 8,509,000 Titles are empty ( not found)  pages ?
https://www.imdb.com/title/tt8509000/reference

https://www.imdb.com/name/nm9884600/ Names now

about 266,164,000 messages were posted on the Old Boards Aug 6 2002 ... Feb 20 2017

88,854,000 User now ... well, some are gone with the wind!
https://www.imdb.com/user/ur88854000

https://www.imdb.com/pressroom/stats/
Titles: 4,734,693
People: 8,702,001

Fri Dec 22 2017 - Titles | People
 This data is (temporarily) in the press room
 until we can rebuild the page on the new technology, sorry.
 It will be updated somewhere between monthly and quarterly in the meantime
 Col Needham, Official Rep

Add to top of  Titles and Name Pages ??
Page Started  _____
Page Updated _____


2.7K Messages

 • 

83K Points

6 years ago

Davita van der Velde (https://www.imdb.com/name/nm7936268/) also doesn't have any credits or other information on her page. It's not new name page either.
The same goes for the (wrongfully formatted) name page of Thomas Van der Vorst (https://www.imdb.com/name/nm6027204/).

Champion

 • 

19.4K Messages

 • 

477.1K Points

6 years ago

We could make use of Interfaces IMDb Datasets Subsets of IMDb data for personal and non-commercial use (https://www.imdb.com/interfaces/?ref_=helpms_ih_gi_siteindex), to determine which people in the IMDb database have no roles. Specifically, this data set:
name.basics.tsv.gz – Contains the following information for names:
  • nconst (string) - alphanumeric unique identifier of the name/person
  • primaryName (string)– name by which the person is most often credited
  • birthYear – in YYYY format
  • deathYear – in YYYY format if applicable, else ‘\N’
  • primaryProfession (array of strings)– the top-3 professions of the person
  • knownForTitles (array of tconsts) – titles the person is known for
Any person who does not have any knownForTitles would be a candidate for removal. The list of name pages without credits would still need to be curated to ensure that valid pages are not removed.

I suspect that ljdoncel (Champion) could readily produce this report since he has access to SAS (Statistical Analysis System) and computing facilities that can handle the size of the data set. 

Champion

 • 

19.4K Messages

 • 

477.1K Points

Here are the first 50 lines of data from the file:
The unzipped raw data file is huge, 509 MBytes

8.5K Messages

 • 

176.2K Points

https://www.imdb.com/name/nm0002778/ Scott Wetsel...

Old Numbers there

added June 1 or 2, 2018
Dominic Hyam
https://www.imdb.com/name/nm9879000/

Central Tonight (TV Series)
30th May 2018 Late News (2018) ... Himself - Defender, Coventry City


Employee

 • 

7.3K Messages

 • 

179.2K Points

Any person who does not have any knownForTitles would be a candidate for removal.
While they would be a candidate, there are many people who belong in the database yet lack any credits and therefore have no "Known For" data. For example, agents / managers / lawyers etc with clients and similarly other people employed by companies listed in IMDb who are never credited on-screen.

Champion

 • 

1.1K Messages

 • 

51.5K Points

Hi, everybody!
...there are many people who belong in the database yet lack any credits and therefore have no "Known For" data
I agree with Col. Just for the record, from name.basics.tsv.gz (as of 1 Jun 2018):
  • Total names: 8,640,407
       Names without birth or death year listed: 8,212,689 (95.0%)
       Names without any profession listed: 1,508,868 (17.4%)
       Names without any "Known for" title listed: 992,150 (11.5%)
       Names without any profession and "Known for" title: 752,266 (8.7%)

Any person who does not have any knownForTitles would be a candidate for removal. The list of name pages without credits would still need to be curated to ensure that valid pages are not removed.
But Dan is also right that this approach has a certain potential for identifying empty pages because (and that's why I've highlighted the word "yet" in Col's statement), most candidates for removal are people added to the database during the last months/years, as can be seen in the following barcode-like graphic (higher nmconsts are more likely to be all-blank entries in the tsv file):
So, curating/exploring candidates with the lowest nmconsts could be a good source of actual empty pages.

However, there are some other limiting factors for this method; e.g. Irene Albaladejo is an "empty page" according to the tsv file, but only because "thanks" credits aren't acknowledged as a "profession" (false positive).

(sorry for not being more specific; I've been very busy recently for laboral reasons and I barely have time to other things ; indeed, I'm trying to finish a report regarding extreme values of the database and tsv files that involve many calculations and, at best, I can dedicate only 2-3 hours a week... I hope to be a bit less overloaded soon...)


@Dan: My computing facilities is a simple 3-year-old & less-than-900€ PC...

Champion

 • 

19.4K Messages

 • 

477.1K Points

Thank you ljdoncel!
I understand that you are extremely busy. I would much rather have you save live as a cardiologist.
@Dan: My computing facilities is a simple 3-year-old & less-than-900€ PC...
I stand corrected.
I wish I could afford a license for SAS, SPSS or JMP. I guess I will have to install R and learn how to use R to parse and summarize IMDb's raw data files.

8.5K Messages

 • 

176.2K Points

Dan Dassow, Champion : 

Here are the first 50 lines of data from the file:
- - -

I wish I could afford a license for SAS, SPSS or JMP.
I guess I will have to install R
and learn how to use R to parse and summarize IMDb's raw data files.
by Dan Dassow
- - -


https://en.wikipedia.org/wiki/R_(programming_language)
R is a GNU package.

Look like something I could read 30 years ago using
https://en.wikipedia.org/wiki/BASIC


June 4, 2018
https://en.wikipedia.org/wiki/Portal:Current_events/2018_June_4
Business and economy
Microsoft announces that it is acquiring code repository GitHub
for US$ 7.5 billion in stock, pending regulatory review

https://en.wikipedia.org/wiki/GitHub
GitHub Inc. is a web-based hosting service
for version control using Git.
It is mostly used for computer code

https://en.wikipedia.org/wiki/Git
Git  is a version control system for tracking changes in computer files
 and coordinating work on those files among multiple people
Git is free and open source software
distributed under the terms of the GNU General Public License version 2.


Champion

 • 

19.4K Messages

 • 

477.1K Points

ACT_1,
I could easily parse the files with BASIC, FORTRAN, C, C++, Java, Pascal, Pearl COBOL, SNOBOL and Excel. Summarizing with these languages additional programming. Statistical languages such as SAS, SPSS, JMP and R have embedded means to perform statistics on data sets.

8.5K Messages

 • 

176.2K Points

Dan Dassow, Champion

I know nothing about SAS, SPSS, JMP and R
I have not done any programming in Years
Looking at your Sample
Seems They could use Quotes and Commas format ? (Is this correct?)
"Number","Lastname", "Firstname", "Birthyear", "Deathyear",
"Profession", "Title_1", "Title_2", "Title_3", "Title_4"
may be easier to separate (with last names like 'de Haviland')
EZ to sort by Lastname + Firstname
But that would take up a lot more file space with

https://www.imdb.com/name/nm9886000/ Names now

oh, and .dbf file has fixed length fields : more bulky ?

I did set up a .dbf file for community theatre for Plays and Names 20? years ago
with DOS Alpha_Four later updated to
https://en.wikipedia.org/wiki/Alpha_Five_(database) (I did not use)
but not as much info that IMDb keeps

- - -

Waiting for a reply from Col Needham, Official Rep to my above message
about Titles with No credits listed here

2.7K Messages

 • 

83K Points

6 years ago

It's been five months and there are apparently tens of thousands empty IMDb name pages, so I figured a bump can't hurt.

2.7K Messages

 • 

83K Points

It's been five months and there are apparently tens of thousands empty IMDb name pages, so I figured a bump can't hurt.

Eleven months later, so another bump can't hurt.