Page MenuHomePhabricator

Write and run script to find non-existent images on Wikimedia wikis
Open, Needs TriagePublic

Description

Currently on Commons and on the various projects, there are images that have description pages but their images have disappeared. Because the meta data (MIME type, file size, directory, etc.) is stored in the database, it's not easily possible to find which images have this issue. (Without a definitive list of broken images, it also makes it very difficult to know whether this is a growing problem somehow related to WMF's servers or not.)

It would be nice if someone were to write a script that checked each image in a database to ensure that it exists and that it is not 0 bytes. Example (though this will likely be deleted by an admin at some point): http://commons.wikimedia.org/wiki/Image:Hatogayacity_Fire_Department.jpg


Version: unspecified
Severity: enhancement

Details

Reference
bz15889

Event Timeline

bzimport raised the priority of this task from to Low.
bzimport set Reference to bz15889.
bzimport added a subscriber: Unknown Object (MLST).
MZMcBride created this task.Oct 7 2008, 9:17 PM

Is this still a issue that needs to be investigated?

Since you are a TS wizz, Can this be done with the recent DB metadata improvements?

Bumping this. Still an issue?

Are we talking about:

  1. Running cleanupImages.php? Would probably need shell.
  2. Running a database query? It would be something like "SELECT page_title FROM page WHERE NOT EXISTS (SELECT img_name FROM image WHERE img_name = page_title) AND NOT EXISTS (SELECT img_name FROM commonswiki_p.image WHERE img_name = page_title) AND page_namespace = 6 AND page_is_redirect = 0 LIMIT 1000;"
  3. Fixing bug 32551?

(In reply to TeleComNasSprVen from comment #4)

Are we talking about:

  1. Running cleanupImages.php? Would probably need shell.
  2. Running a database query? It would be something like "SELECT page_title FROM page WHERE NOT EXISTS (SELECT img_name FROM image WHERE img_name = page_title) AND NOT EXISTS (SELECT img_name FROM commonswiki_p.image WHERE img_name = page_title) AND page_namespace = 6 AND page_is_redirect = 0 LIMIT 1000;"
  3. Fixing bug 32551?

The bug summary is "Write and run script to find non-existent images on Wikimedia wikis". We're probably talking about that.

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptAug 9 2015, 9:30 AM
Peachey88 set Security to None.
Jdforrester-WMF moved this task from Untriaged to Backlog on the Multimedia board.Sep 4 2015, 5:51 PM
Restricted Application added a subscriber: JEumerus. · View Herald TranscriptJan 26 2016, 12:45 PM
Meno25 removed a subscriber: Meno25.Feb 22 2016, 6:06 PM

Are we talking about:
[…]

  1. Fixing bug 32551?

T34551 has been resolved.

Dereckson assigned this task to MZMcBride.

[ Assigning this to original poster, to clarify the requested action and the goal per previous comment. Once done, feel free to deassign for further triaging. ]

Dereckson raised the priority of this task from Low to Needs Triage.Sep 9 2016, 2:54 AM

The idea here was to have a script iterate over all known images in a MediaWiki wiki and check that the files actually exist on the file system and are accessible and are not corrupt. A maintenance script that checks every image's SHA1 or file size from the file system directly and compares to the information we have stored in the image database table would resolve this task.

It's been many years since this task was filed. It may now be moot.

That could be combined with T32961.

maintenance/findMissingFiles.php already exists. I'd rather not combine T32961 with anything else.