Page MenuHomePhabricator

Clean up old images on wikitech-static
Open, MediumPublic

Description

The daily import script does not clean up images that were imported in the past but are no longer referenced by a current revision. Most of the time that's the desired behavior, but for wikitech-static, we only care about what current revisions need. A script to clean these up would be good.

Example 'extra' image (description page from commons):
https://wikitech.wikimedia.org/wiki/File:%D0%9F%D0%B0%D0%BC%D1%8F%D1%82%D0%BD%D0%B8%D0%BA_%D0%A7%D0%B0%D0%BF%D0%B0%D0%B5%D0%B2%D1%83_%D0%92.%D0%98._%D0%B2_%D0%A1%D0%B0%D0%BC%D0%B0%D1%80%D0%B5.jpg
Introduced by a revision created on Oct 15 2018:
https://wikitech.wikimedia.org/w/index.php?title=User:Atsirlin/page9&action=history
File on disk on wikitech-static:
-rw-r--r-- 1 www-data www-data 5203729 Oct 15 2018 Памятник_Чапаеву_В.И._в_Самаре.jpg

The current revision of User:Atsirlin/page9 is empty on both production and -static wikitech.

Event Timeline

ArielGlenn triaged this task as Medium priority.Dec 4 2019, 12:43 PM
ArielGlenn created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 4 2019, 12:43 PM
ArielGlenn added a comment.EditedDec 4 2019, 12:46 PM

We need the following:

  • identify all images that are not used on wikitech (production) except in File: pages
  • delete all of those on wikitech-static via mediawiki api, which would remove the File pages and archive the images
  • run the https://www.mediawiki.org/wiki/Manual:DeleteArchivedFiles.php script on wikitech-static to remove all those archived images

These steps could be run once a month or so to keep the static copy of the wiki squeaky-clean.

To get the list of images not used, we could:

  • collect all image names from the imagelinks table (column 'il_to')
  • normalize those image names
  • for each image in the image table (column 'img_name'), normalize the name, see if it's in the above list, otherwise output to a potential list to be purged

We'd want to check the list at first and make sure we got the normalization right so that we're not deleting something we want.

There might be some better or even already written tool for this; please, MediaWiki people, do your thing and point us to it!

Volans added a subscriber: CDanis.Dec 4 2019, 1:37 PM

After a brief discussion on irc, there are a couple of suggestions for updating the content of Special:UnusedFiles (which could then be used via the api, we hope):

James_F: Maybe refreshLinks will be sufficient?
mainframe98: apergos: couldn't you use https://www.mediawiki.org/wiki/Manual:UpdateSpecialPages.php with the --only flag?

After a brief discussion on irc, there are a couple of suggestions for updating the content of Special:UnusedFiles (which could then be used via the api, we hope)

If you'd like to retrieve the values of that special page, you can use the query+querypage api module to get the values of that special page through the api: https://wikitech-static.wikimedia.org/wiki/Special:ApiSandbox#action=query&format=json&list=querypage&qppage=Unusedimages

Note that, confusingly enough, the page is named unusedimages, this does refer to Special:UnusedFiles, which has an alias named UnusedImages.