Page MenuHomePhabricator

Migrate MediaModeration jobs to mw-cron
Closed, ResolvedPublic

Description

Migrate MediaModeration periodic mediawiki jobs from mwmaint to mw-cron on kubernetes.

  • mediamoderation-updateMetrics
  • mediamoderation-hourlyScan
  • mediamoderation-continuousScan-commonswiki

Doc on the new platform

ServiceOps new will handle migrating the jobs, but would appreciate input from Trust and Safety Product Team on:

  • jobs that should be watched more
  • jobs that are low criticality and could be migrated first
  • outdated jobs that can be removed
  • any potential gotchas in the way these jobs use MediaWiki

Event Timeline

Clement_Goubert triaged this task as Medium priority.

Thanks for checking for our input!

  • jobs that should be watched more

The mediamoderation-continuousScan-commonswiki job is running pretty much all of the time, being restarted every hour.

Downtime on this should be avoided if possible, but downtime issues would be automatically resolved by the script when it's started again by just processing the backlog created in the downtime.

  • jobs that are low criticality and could be migrated first

The mediamoderation-updateMetrics job is only collecting metrics for use on a Grafana dashboard every day, so this isn't critical. However, it probably makes sense to avoid running this more than once.

  • outdated jobs that can be removed

None of these jobs are outdated, so nothing to remove.

  • any potential gotchas in the way these jobs use MediaWiki

For mediamoderation-updateMetrics and mediamoderation-hourlyScan:

  • Try to avoid running more than one instance of the same job at a time, to avoid duplications in scanning images
  • The scripts read data from the extension1 database cluster (noting this in case the new method changes how database access works)
  • The scripts are essentially a wrapper for queuing jobs, with a maximum of 200 jobs queued by them every 10 seconds (noting this in case the new method changes how jobs are queued)

Thanks for checking for our input!

  • jobs that should be watched more

The mediamoderation-continuousScan-commonswiki job is running pretty much all of the time, being restarted every hour.

Downtime on this should be avoided if possible, but downtime issues would be automatically resolved by the script when it's started again by just processing the backlog created in the downtime.

We can arrange its migration to happen say 15 minutes before it would be restarted, this way downtime would be minimized. However if the first run on kubernetes doesn't work, and we need to rollback to mwmaint, we would be looking at an hour of downtime before it is restarted automatically, or launching a manual run.

  • jobs that are low criticality and could be migrated first

The mediamoderation-updateMetrics job is only collecting metrics for use on a Grafana dashboard every day, so this isn't critical.

Great, then it will be among the first to move. Could you link the dashboard so we can check that the metrics updated correctly?

  • outdated jobs that can be removed

None of these jobs are outdated, so nothing to remove.

I had to try :D

  • any potential gotchas in the way these jobs use MediaWiki

For mediamoderation-updateMetrics and mediamoderation-hourlyScan:

  • Try to avoid running more than one instance of the same job at a time, to avoid duplications in scanning images
  • The scripts read data from the extension1 database cluster (noting this in case the new method changes how database access works)
  • The scripts are essentially a wrapper for queuing jobs, with a maximum of 200 jobs queued by them every 10 seconds (noting this in case the new method changes how jobs are queued)

Thanks for the info, the way the migrations should go, the job would be stopped and removed from mwmaint before being recreated as a kubernetes CronJob in the mwcron namespace. The new method runs in the same containerized mediawiki environment as current production, and as mw-script.

I will provide you with a link to a logstash dashboard where you'll be able to select your team and see the logs from the jobs that have been migrated.

We can arrange its migration to happen say 15 minutes before it would be restarted, this way downtime would be minimized. However if the first run on kubernetes doesn't work, and we need to rollback to mwmaint, we would be looking at an hour of downtime before it is restarted automatically, or launching a manual run.

An hour of downtime should be okay. It only really becomes an issue for a day or so of downtime.

Great, then it will be among the first to move. Could you link the dashboard so we can check that the metrics updated correctly?

The dashboard is https://grafana.wikimedia.org/d/ab43c410-afca-4777-8244-53cea7ce2a2c/mediamoderation-photodna-stats?orgId=1&refresh=5m - The metrics generated are those shown in the Scanned/Unscanned % and Not checkable % panels.

Thanks for the info, the way the migrations should go, the job would be stopped and removed from mwmaint before being recreated as a kubernetes CronJob in the mwcron namespace. The new method runs in the same containerized mediawiki environment as current production, and as mw-script.

I will provide you with a link to a logstash dashboard where you'll be able to select your team and see the logs from the jobs that have been migrated.

Thanks!

Change #1139080 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] mw:maintenance: migrate mediamoderation-updateMetrics to k8s

https://gerrit.wikimedia.org/r/1139080

Change #1139080 merged by Hnowlan:

[operations/puppet@production] mw:maintenance: migrate mediamoderation-updateMetrics to k8s

https://gerrit.wikimedia.org/r/1139080

Change #1139415 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] mw::maintenance: migrate mediamoderation-hourlyScan to k8s

https://gerrit.wikimedia.org/r/1139415

mediamoderation-updateMetrics appears to have run successfully this morning.

I would agree with that. It appears to be working the same.

Change #1139415 merged by Hnowlan:

[operations/puppet@production] mw::maintenance: migrate mediamoderation-hourlyScan to k8s

https://gerrit.wikimedia.org/r/1139415

hourlyScan is running in Kubernetes at the moment - output makes it look like it's behaving as desired, but it would probably be best to verify. You can read output by running kube_env mw-cron eqiad; kubectl logs mediamoderation-hourlyscan-29100362-4c745 mediawiki-main-app -f on a deploy server.

hourlyScan is running in Kubernetes at the moment - output makes it look like it's behaving as desired, but it would probably be best to verify. You can read output by running kube_env mw-cron eqiad; kubectl logs mediamoderation-hourlyscan-29100362-4c745 mediawiki-main-app -f on a deploy server.

Looks good to me.

Change #1140484 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] mw::maintenance: migrate continuousScan-commonswiki

https://gerrit.wikimedia.org/r/1140484

I'm not sure if it is related, but after the deployment of https://gerrit.wikimedia.org/r/1139415 the metric for unscannable images has started to spike. I'm not sure how this would be the cause, but wanted to mention it.

Clement_Goubert changed the task status from Open to In Progress.May 20 2025, 3:35 PM

I'm not sure if it is related, but after the deployment of https://gerrit.wikimedia.org/r/1139415 the metric for unscannable images has started to spike. I'm not sure how this would be the cause, but wanted to mention it.

Noting on task that it was unrelated to the migration, as the unscannable spike came from the commonswiki job which isn't migrated yet.

On that note, do you think we can move ahead with the migration of the commonswiki job?

Clement_Goubert raised the priority of this task from Medium to High.May 20 2025, 3:39 PM

On that note, do you think we can move ahead with the migration of the commonswiki job?

Yes, I think it can be migrated now. Thanks.

Change #1140484 merged by Hnowlan:

[operations/puppet@production] mw::maintenance: migrate continuousScan-commonswiki

https://gerrit.wikimedia.org/r/1140484

We can arrange its migration to happen say 15 minutes before it would be restarted, this way downtime would be minimized. However if the first run on kubernetes doesn't work, and we need to rollback to mwmaint, we would be looking at an hour of downtime before it is restarted automatically, or launching a manual run.

An hour of downtime should be okay. It only really becomes an issue for a day or so of downtime.

This job has been migrated, but unfortunately timings didn't line up fully so it will miss a single run. I'll be keeping an eye on subsequent runs though and I'll update how it went.

We can arrange its migration to happen say 15 minutes before it would be restarted, this way downtime would be minimized. However if the first run on kubernetes doesn't work, and we need to rollback to mwmaint, we would be looking at an hour of downtime before it is restarted automatically, or launching a manual run.

An hour of downtime should be okay. It only really becomes an issue for a day or so of downtime.

This job has been migrated, but unfortunately timings didn't line up fully so it will miss a single run. I'll be keeping an eye on subsequent runs though and I'll update how it went.

Thanks!

We can arrange its migration to happen say 15 minutes before it would be restarted, this way downtime would be minimized. However if the first run on kubernetes doesn't work, and we need to rollback to mwmaint, we would be looking at an hour of downtime before it is restarted automatically, or launching a manual run.

An hour of downtime should be okay. It only really becomes an issue for a day or so of downtime.

This job has been migrated, but unfortunately timings didn't line up fully so it will miss a single run. I'll be keeping an eye on subsequent runs though and I'll update how it went.

Thanks!

Based on the dashboard and the pod output it looks like the job is behaving at least somewhat as expected - I'll keep an eye on how it dies/recreates. Once we're fully happy with how the migration of this job has gone we can replace the use of timeout with the built in concurrency rules in Kubernetes to replace the job every hour.

We can arrange its migration to happen say 15 minutes before it would be restarted, this way downtime would be minimized. However if the first run on kubernetes doesn't work, and we need to rollback to mwmaint, we would be looking at an hour of downtime before it is restarted automatically, or launching a manual run.

An hour of downtime should be okay. It only really becomes an issue for a day or so of downtime.

This job has been migrated, but unfortunately timings didn't line up fully so it will miss a single run. I'll be keeping an eye on subsequent runs though and I'll update how it went.

Thanks!

Based on the dashboard and the pod output it looks like the job is behaving at least somewhat as expected - I'll keep an eye on how it dies/recreates. Once we're fully happy with how the migration of this job has gone we can replace the use of timeout with the built in concurrency rules in Kubernetes to replace the job every hour.

It looks like it's working to me.

I did see an alert in our Slack saying that the job had failed. I checked the logs but did not see a specific reason. Perhaps the normal timeout caused the alert?

I'm going to start a different scan of all wikis except Wikimedia Commons because we have improved the success rate of scanning recently with some thumbnail generation changes. It will likely cause the metrics to look odd, so saying this here. It should be solvable by filtering for just commonswiki.

The failed job is because of the timeout in the command. It makes the container exit with code 124, which for kubernetes is a failure. If we remove that, the controller will just replace the job every hour without the failure.

Change #1149336 had a related patch set uploaded (by Clรฉment Goubert; author: Clรฉment Goubert):

[operations/puppet@production] mw::maintenance: Remove timeout from continuousScan-commonswiki

https://gerrit.wikimedia.org/r/1149336

Change #1149336 merged by Clรฉment Goubert:

[operations/puppet@production] mw::maintenance: Remove timeout from continuousScan-commonswiki

https://gerrit.wikimedia.org/r/1149336

Removed the timeout, the current run will end up in error as well, and the next one should be fine.

Clement_Goubert updated the task description. (Show Details)

Deleted the failed job to clear the alert, jobs are being correctly replaced on schedule.

Migration complete, ty!