Page MenuHomePhabricator

Write and run script to find non-existent images on Wikimedia wikis
Open, LowPublicFeature

Description

Currently on Commons and on the various projects, there are images that have description pages but their images have disappeared. Because the meta data (MIME type, file size, directory, etc.) is stored in the database, it's not easily possible to find which images have this issue. (Without a definitive list of broken images, it also makes it very difficult to know whether this is a growing problem somehow related to WMF's servers or not.)

It would be nice if someone were to write a script that checked each image in a database to ensure that it exists and that it is not 0 bytes. Example (though this will likely be deleted by an admin at some point): http://commons.wikimedia.org/wiki/Image:Hatogayacity_Fire_Department.jpg


Version: unspecified
Severity: enhancement

Details

Reference
bz15889

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:19 PM
bzimport set Reference to bz15889.
bzimport added a subscriber: Unknown Object (MLST).

Is this still a issue that needs to be investigated?

Since you are a TS wizz, Can this be done with the recent DB metadata improvements?

Bumping this. Still an issue?

Are we talking about:

  1. Running cleanupImages.php? Would probably need shell.
  2. Running a database query? It would be something like "SELECT page_title FROM page WHERE NOT EXISTS (SELECT img_name FROM image WHERE img_name = page_title) AND NOT EXISTS (SELECT img_name FROM commonswiki_p.image WHERE img_name = page_title) AND page_namespace = 6 AND page_is_redirect = 0 LIMIT 1000;"
  3. Fixing bug 32551?

(In reply to TeleComNasSprVen from comment #4)

Are we talking about:

  1. Running cleanupImages.php? Would probably need shell.
  2. Running a database query? It would be something like "SELECT page_title

FROM page WHERE NOT EXISTS (SELECT img_name FROM image WHERE img_name =
page_title) AND NOT EXISTS (SELECT img_name FROM commonswiki_p.image WHERE
img_name = page_title) AND page_namespace = 6 AND page_is_redirect = 0 LIMIT
1000;"

  1. Fixing bug 32551?

The bug summary is "Write and run script to find non-existent images on Wikimedia wikis". We're probably talking about that.

Are we talking about:
[…]

  1. Fixing bug 32551?

T34551 has been resolved.

[ Assigning this to original poster, to clarify the requested action and the goal per previous comment. Once done, feel free to deassign for further triaging. ]

Dereckson raised the priority of this task from Low to Needs Triage.Sep 9 2016, 2:54 AM

The idea here was to have a script iterate over all known images in a MediaWiki wiki and check that the files actually exist on the file system and are accessible and are not corrupt. A maintenance script that checks every image's SHA1 or file size from the file system directly and compares to the information we have stored in the image database table would resolve this task.

It's been many years since this task was filed. It may now be moot.

maintenance/findMissingFiles.php already exists. I'd rather not combine T32961 with anything else.

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)

maintenance/findMissingFiles.php already exists.

...which makes me wonder what else is still wanted in this task.

I ran this script at testwiki, to see how it works:

[urbanecm@mwmaint2001 ~]$ time mwscript findMissingFiles.php --wiki=testwiki
mwstore://local-multiwrite/local-public/archive/a/a3/20090409180603!AnyNonsense.png
mwstore://local-multiwrite/local-public/archive/4/47/20120830220223!Liberty_Bell_slot_machine_2012-07-30_13-52-19.jpg
mwstore://local-multiwrite/local-public/archive/5/51/20170823084259!SIPIJellyBeans.jpg
mwstore://local-multiwrite/local-public/archive/9/97/20200609104426!Test-by-nokib2.jpg
mwstore://local-multiwrite/local-public/archive/9/9c/20200716100612!Test_image_copyright.jpg

real    0m27.704s
user    0m7.204s
sys     0m1.756s
[urbanecm@mwmaint2001 ~]$

Can we confirm it yields the expected output before running it at more wikis?

Aklapper triaged this task as Low priority.Feb 4 2022, 10:51 AM
Aklapper changed the subtype of this task from "Task" to "Feature Request".