Page MenuHomePhabricator

Request creation of commons-corruption-checker VPS project (2)
Closed, ResolvedPublic


Project Name: commons-corruption-checker

Wikitech Usernames of requestors: TheSandDoctor

Purpose: Check existing images on Wikimedia Commons for corruption and monitor uploads indefinitely

Brief description: Please see T241635. I was distracted due to current affairs and off-wiki activities and missed the project's deletion. I plan to spend more time on it and am requesting that the VPS project be re-created. Copying from there:

This task will require the installation of python and python packages: mwclient, mwparserfromhell, Pywikibot, sseclient, and my own version of Pillow (PIL). I need this version to be installed due to the potential for large images. This bot task has been approved on Commons. This task works by downloading all of the images (though only one at a time will be processed/downloaded and deleted afterwords), scanning them, and then logging in a database the result. In the event that corruption is detected, the uploader is then notified. After either 7 (for new uploads) or 30 days (existing catalogue) has passed, the images are then re-downloaded and re-checked. If their hashes match the previously checked version (aka unchanged/still corrupt), then it is tagged for speedy deletion and the uploader notified of this action.

I am definitely open to adding collaborators on this task/project and would not have "closed" membership.

How soon you are hoping this can be fulfilled: as soon as possible

Event Timeline

TheSandDoctor renamed this task from Request creation of <PROJECT-NAME> VPS project to Request creation of commons-corruption-checker VPS project.Apr 4 2021, 6:02 AM
TheSandDoctor renamed this task from Request creation of commons-corruption-checker VPS project to Request creation of commons-corruption-checker VPS project (2).

This sounds like you might need some storage of significant size for temp space right? Is that why you aren't aiming to put it on Toolforge or is that because of the forked Pillow?

@Bstorm both. Re storage space: potentially yes, but only for the odd edge case if it runs across a gigapixel image (also why Pillow is forked to add handling for niche use case). Being space conscious, each worker only downloads/works with one file at a time and deletes the temp copy once done. I tend to run variable numbers of workers depending on the work load.

Program runs with a watcher/worker model, where the watcher shoves all relevant details (mostly just title, hash to verify download) on a text based redis queue, the workers then pop, download, process, do whatever is needed as result, delete. This is used since images are uploaded faster than can be processed in real time and was a suggested implementation at the commons approval.

I was asking about storage because the latest round of flavors come with only 20GB a piece. That suggests you may want to use cinder storage (usually for important data), something like scratch NFS or similar. If you aren't likely to need that at first, at least, when we can just make the project and figure that out later.

@Bstorm would it be something that could be easily attached or would that require the instance recreated? If the latter, we should probably do now. If it can be added seamlessly, then later is fine. Just wondering...if later, I’d just file a ticket here requesting or?

It can be added easily later. We'll move this along.