Page MenuHomePhabricator

Investigate automated execution for MediaModeration [16H]
Closed, ResolvedPublic

Description

As a system, I would like the mediamoderation script to run on an ongoing basis without user intervention. This could be setup as a recurring cron task (e.g scheduled by date and time), kicked off by a listener for relevant events (such as image uploads), or some other initiating conditions for the ongoing or recurring execution of the script resulting in all images eventually getting checked. As part of the mediamoderation automation, we will need to investigate automating (2) scripts.

  1. MediaModeration TimeStamp Script
  2. MediaModeration Purging Script

Event Timeline

Eventually, Thalia Chan on AHaT might be useful to talk to for this ticket.

ARamirez_WMF added a subscriber: eigyan.
jsn.sherman renamed this task from Investigate cron job for MediaModeration to Investigate automated execution for MediaModeration.Feb 22 2022, 7:08 PM
jsn.sherman updated the task description. (Show Details)
ARamirez_WMF renamed this task from Investigate automated execution for MediaModeration to Investigate automated execution for MediaModeration [8H].Mar 7 2022, 4:13 PM
eigyan renamed this task from Investigate automated execution for MediaModeration [8H] to Investigate automated execution for MediaModeration [16H].May 16 2022, 6:11 PM

Script execution:

Script runs on mwmaint1002.eqiad.wmnet

mwscript extensions/MediaModeration/maintenance/ModerateExistingFiles.php --wiki commonswiki --batch-size=1000 --batch-count=5000 --start=20220104105243 ---(MediaModeration Purging Script)

kafkacat -b kafka-main1001.eqiad.wmnet -t "eqiad.mediawiki.job.processMediaModeration" -c 1 -o -1on ---(MediaModeration TimeStamp Script)

*returns a JSON response containing timestamp

Users currently have to ssh into the machine and run the execution command.

Our desire is to automate the process and remove the human interaction from this process.

Script Options:

--wiki - commonswiki

—batch-size - 1000 - number of images to process at a time.

—batch-count - 5000 - number of times to run the process.

—timestamp - 20220104105243 - the last time the script was run. - currently obtained by running a kafkacat script on a different machine than the purging script runs on.


Control how many images are run using --batch-count.

(batch-size*batch-count) = total amount of images to be processed. 1000 * 5000 = 5,000,000 images processed.

Script Automation Criteria:

1. Remove the dependency for a person to login to obtain the last timestamp.
  1. Create a script/middleware that will retrieve the last time stamp to be read by the execution script.
  2. Create a CRON JOB that will execute the aforementioned script.
  3. Store the timestamped value in a file on disk to later be read by the moderation script.
  4. Remove the need to run a script to obtain the last time stamp and write the last time stamp either at moderation script completion or as each file is being written (just in case the script ends abruptly)
2. Remove the dependency for a person to login to execute the maintenance script with time stamp.
  1. Add to cron tab a job that will execute the moderation script, but only if the timestamp was read/provided to the script.
  2. Create .env file containing required options like —wiki, —batch—count, etc.

Script Automation Criteria:

  1. Remove the dependency for a person to login to obtain the last timestamp by replacing it with a DB table T308551