Page MenuHomePhabricator

Write and run script to find non-existent images on Wikimedia wikis
Open, Needs TriagePublic

Description

Currently on Commons and on the various projects, there are images that have description pages but their images have disappeared. Because the meta data (MIME type, file size, directory, etc.) is stored in the database, it's not easily possible to find which images have this issue. (Without a definitive list of broken images, it also makes it very difficult to know whether this is a growing problem somehow related to WMF's servers or not.)

It would be nice if someone were to write a script that checked each image in a database to ensure that it exists and that it is not 0 bytes. Example (though this will likely be deleted by an admin at some point): http://commons.wikimedia.org/wiki/Image:Hatogayacity_Fire_Department.jpg


Version: unspecified
Severity: enhancement

Details

Reference
bz15889

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:19 PM
bzimport set Reference to bz15889.
bzimport added a subscriber: Unknown Object (MLST).
MZMcBride created this task.Oct 7 2008, 9:17 PM

Is this still a issue that needs to be investigated?

Since you are a TS wizz, Can this be done with the recent DB metadata improvements?

Bumping this. Still an issue?

Are we talking about:

  1. Running cleanupImages.php? Would probably need shell.
  2. Running a database query? It would be something like "SELECT page_title FROM page WHERE NOT EXISTS (SELECT img_name FROM image WHERE img_name = page_title) AND NOT EXISTS (SELECT img_name FROM commonswiki_p.image WHERE img_name = page_title) AND page_namespace = 6 AND page_is_redirect = 0 LIMIT 1000;"
  3. Fixing bug 32551?

(In reply to TeleComNasSprVen from comment #4)

Are we talking about:

  1. Running cleanupImages.php? Would probably need shell.
  2. Running a database query? It would be something like "SELECT page_title

FROM page WHERE NOT EXISTS (SELECT img_name FROM image WHERE img_name =
page_title) AND NOT EXISTS (SELECT img_name FROM commonswiki_p.image WHERE
img_name = page_title) AND page_namespace = 6 AND page_is_redirect = 0 LIMIT
1000;"

  1. Fixing bug 32551?

The bug summary is "Write and run script to find non-existent images on Wikimedia wikis". We're probably talking about that.

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptAug 9 2015, 9:30 AM
Peachey88 set Security to None.
Jdforrester-WMF moved this task from Untriaged to Backlog on the Multimedia board.Sep 4 2015, 5:51 PM
Restricted Application added a subscriber: JEumerus. · View Herald TranscriptJan 26 2016, 12:45 PM
Meno25 removed a subscriber: Meno25.Feb 22 2016, 6:06 PM

Are we talking about:
[…]

  1. Fixing bug 32551?

T34551 has been resolved.

Dereckson assigned this task to MZMcBride.EditedSep 9 2016, 2:53 AM

[ Assigning this to original poster, to clarify the requested action and the goal per previous comment. Once done, feel free to deassign for further triaging. ]

Dereckson raised the priority of this task from Low to Needs Triage.Sep 9 2016, 2:54 AM

The idea here was to have a script iterate over all known images in a MediaWiki wiki and check that the files actually exist on the file system and are accessible and are not corrupt. A maintenance script that checks every image's SHA1 or file size from the file system directly and compares to the information we have stored in the image database table would resolve this task.

It's been many years since this task was filed. It may now be moot.

That could be combined with T32961.

maintenance/findMissingFiles.php already exists. I'd rather not combine T32961 with anything else.

Aklapper removed MZMcBride as the assignee of this task.Fri, Jun 19, 4:17 PM

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)