Page MenuHomePhabricator

Create maintenance script to queue images for checking
Closed, ResolvedPublic

Description

The maintenance script would take an optional reference to the image to begin with and a maximum count of how many images to check. If feasible, there should be a way to indicate if an image has already been checked so it can be skipped, and there should be a way to force checking of an image even if it was marked as already checked. The images to be checked would be queued for asynchronous checking.

Event Timeline

@Pchelolo What is the best way to manage state that we processed the picture and no need to recheck it again?

Change 583630 had a related patch set uploaded (by Peter.ovchyn; owner: Peter.ovchyn):
[mediawiki/extensions/MediaModeration@master] Create maintenance script to queue images for checking

https://gerrit.wikimedia.org/r/583630

So, the script should iterate over images and post jobs. Given that the extension is WMF-specific and is not intended for third-party use, we can rely on the WMF Kafka-based job queue properties.

Thus, the script should simply post a job per image, shouldn't care about spreading the jobs in time or job retries. This will be done by the jobqueue system. The jobs have to be posted not just for the latest versions of the file, but for each specific version of the file so that we check them all.

Given that Kafka capacity for holding the jobs is practically infinite (if we adjust the Kafka topic GC time for the topic holding these jobs to, say, a year), we could potentially just loop over all images, post this several hundred million jobs into a separate topic, sit back, relax and wait for the queue to crawl over it slowly. All the failed matches will be saved in an error queue, that we would be able to subsequently re-process.

Protecting agains script failing and restarting cold be done by simply outputting the start position of each batch and manually restarting the script. Some duplicates can be filtered out by the queue deduplication system (set removeDuplicates to true in job specification).

This is the bare minimum one-off script that we should do.

Adding on top of it, we would probably want some persistence:

  • Add a property to the image metadata regarding whether the image was checked or not and what the result of the check was. This would allow us to rerun the script and re-check only the images that have been missed by previous runs. For that we can write a boolean field into image metadata in core db, or use structured data on commons.
  • Persist where the script has left off. This would allow us to call the script with a cronjob and post N jobs per M amount of time, spreading the load without creating level hundred million jobs at once. Where to put this counter I do not really now. We could: a) create database table? b) put the counter into the main object cache? c) put the counter somewhere else?

Pinging @Joe for his opinion on the latter question.

@Pchelolo Speaking about Old/New files processing.

It looks like the script could process either New or Old files.
But not both simultaneously.

Is that ok?

@Pchelolo @CCicalese_WMF

Just wonder if we really need to process Archived/Old revisions via the script?

They’re invisible via UI (and API, if I’m not mistaken) and the only way to get the content is to restore that file.

This way we could hook it and process like newly uploaded.

Even if we really need to process old revisions, what should be the result? What info should be sent to Safety Team?

Just wonder if we really need to process Archived/Old revisions via the script?

Archived !== Old. Look: https://commons.wikimedia.org/wiki/File:Large_white_(Pieris_brassicae)_underside.jpg

In the file history you can see various versions of the file. In this case - 1 old version. You can actually link it: https://upload.wikimedia.org/wikipedia/commons/archive/6/69/20180323093335%21Large_white_%28Pieris_brassicae%29_underside.jpg

Change 583630 merged by jenkins-bot:
[mediawiki/extensions/MediaModeration@master] Create maintenance script to queue images for checking

https://gerrit.wikimedia.org/r/583630

WE got the basic MVP version of the script in. Resolving this for now, once we get the extension deployed, we can start improving the script.