Page MenuHomePhabricator

Migrate trust_and_safety_product_team jobs to mw-cron
Closed, ResolvedPublic

Description

Migrate Trust and Safety Product Team periodic mediawiki jobs from mwmaint to mw-cron on kubernetes.

Job nameCriticalityDone?
mediawiki_job_globalblocking-fixGlobalBlockWhitelist.timerHY
mediawiki_job_purge_abusefilteripdata.timerY
mediawiki_job_purge_checkuser.timerY
mediawiki_job_purge_expired_blocks.timerY
mediawiki_job_purge_securepollvotedata.timerY
mediawiki_job_wikimediaevents-UpdatePeriodicMetrics-global.timerY
mediawiki_job_wikimediaevents-UpdatePeriodicMetrics-per-wiki.timerY

Doc on the new platform

ServiceOps new will handle migrating the jobs, but would appreciate input from Trust and Safety Product Team on:

  • jobs that should be watched more
  • jobs that are low criticality and could be migrated first
  • outdated jobs that can be removed
  • any potential gotchas in the way these jobs use MediaWiki

Event Timeline

Trust and Safety Product Team don't currently own AbuseFilter, so cannot speak definitively about that extension and mediawiki_job_purge_abusefilteripdata.timer. It is volunteer maintained per Developers/Maintainers.

However, probably has high criticality as it causes the purging IP addresses after 90 days to be consistent with the privacy policy.

Community-Tech owns the MediaWiki-extensions-PageAssessments extension, so best to ask about mediawiki_job_pageassessments_cleanup.timer with them.

As for the rest of the jobs:

  1. mediawiki_job_globalblocking-fixGlobalBlockWhitelist.timer
    1. Low criticality
    2. Not outdated
    3. Should be fine to migrate first
  2. mediawiki_job_purge_checkuser.timer
    1. High criticality (deletes private data after 90 days to be consistent with privacy policy, needs to be run on at least small wikis)
    2. Not outdated
    3. Gotchas: Runs other maintenance scripts (specifically purgeRecentChanges.php from MediaWiki core), deletes rows from local wiki and wikishared DB (on extension1)
  3. mediawiki_job_purge_expired_blocks.timer
    1. Low criticality, but maybe best to not migrate first
    2. Not outdated
  4. mediawiki_job_purge_securepollvotedata.timer
    1. High criticality (deletes private data after 90 days to be consistent with privacy policy)
    2. Not outdated (AFAICS)
  5. mediawiki_job_wikimediaevents-UpdatePeriodicMetrics-global.timer and mediawiki_job_wikimediaevents-UpdatePeriodicMetrics-per-wiki.timer
    1. Lower criticality, as this only updates metrics that are used in the Temporary Accounts Grafana dashboard
    2. Not outdated

@Clement_Goubert I put the wrong name for the one owned by Community-Tech. It's mediawiki_job_pageassessments_cleanup.timer which would be owned by them, not the AbuseFilter one.

@Clement_Goubert I put the wrong name for the one owned by Community-Tech. It's mediawiki_job_pageassessments_cleanup.timer which would be owned by them, not the AbuseFilter one.

Edited, thank you. I put mediawiki_job_purge_abusefilteripdata.timer back in so it's tracked somewhere, if you don't mind.

@Clement_Goubert I put the wrong name for the one owned by Community-Tech. It's mediawiki_job_pageassessments_cleanup.timer which would be owned by them, not the AbuseFilter one.

Edited, thank you. I put mediawiki_job_purge_abusefilteripdata.timer back in so it's tracked somewhere, if you don't mind.

That should be fine.

I was looking at setting up T385709: Periodic job alerting, and realized you've already been onboarded to alertmanager and critical alerts for your team are sent to slack. Do you want the same for periodic jobs, or would you rather have phabricator tasks created with your team's PHID?

I was looking at setting up T385709: Periodic job alerting, and realized you've already been onboarded to alertmanager and critical alerts for your team are sent to slack. Do you want the same for periodic jobs, or would you rather have phabricator tasks created with your team's PHID?

We disabled these alerts because they were being too unreliable. AFAIK we are not using these alerts at the moment and don't have a plan to re-enable them.

As to whether we want Slack or Phab tasks, I would personally advocate for a Phab task. I'll link to this discussion on Slack to get comments from others to check this.

I was looking at setting up T385709: Periodic job alerting, and realized you've already been onboarded to alertmanager and critical alerts for your team are sent to slack. Do you want the same for periodic jobs, or would you rather have phabricator tasks created with your team's PHID?

We disabled these alerts because they were being too unreliable. AFAIK we are not using these alerts at the moment and don't have a plan to re-enable them.

As to whether we want Slack or Phab tasks, I would personally advocate for a Phab task. I'll link to this discussion on Slack to get comments from others to check this.

Would this be one Phab task per alert? Or a new comment added to a Phab task?

If it's one Phab task per alert, then I would say we should go with Slack, to avoid a bunch of cleanup in Phab when there are false positives, or repeated alerts for the same issue, etc.

Change #1135005 had a related patch set uploaded (by Clément Goubert; author: Kamila Součková):

[operations/puppet@production] alertmanager: route T&S tasks to their Slack

https://gerrit.wikimedia.org/r/1135005

By default one task per alert name, the task title is the alert name. When multiple alerts fire with the same name then the task description gets updated to mention all firing alerts. I realized this information was missing from AM documentation, and have expanded the FAQ at https://wikitech.wikimedia.org/wiki/Alertmanager#How_do_I_get_an_alert_to_open_a_task?

By default one task per alert name, the task title is the alert name. When multiple alerts fire with the same name then the task description gets updated to mention all firing alerts. I realized this information was missing from AM documentation, and have expanded the FAQ at https://wikitech.wikimedia.org/wiki/Alertmanager#How_do_I_get_an_alert_to_open_a_task?

To sum up, it would open a task with the title MediaWikiCronJobFailed in your tag, and update the description for each particular CronJob that is failing, as long as the task remains "Open".

@kamila's patch (not yet merged) will be routing the alert to your slack for now, but it's an easy change to make should you decide you want tasks.

I've updated https://wikitech.wikimedia.org/wiki/Periodic_jobs#Monitoring with this information.

Starting with Slack seems fine, we may want to switch to Phab later. Thanks for your work on this.

Change #1135005 merged by Kamila Součková:

[operations/puppet@production] alertmanager: route T&S tasks to their Slack

https://gerrit.wikimedia.org/r/1135005

Change #1139078 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] GlobalBlocking: Migrate fixGlobalBlockWhitelist

https://gerrit.wikimedia.org/r/1139078

Change #1139078 merged by Kamila Součková:

[operations/puppet@production] GlobalBlocking: Migrate fixGlobalBlockWhitelist

https://gerrit.wikimedia.org/r/1139078

Change #1140482 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] mw::maintenance: migrate fixGlobalBlockWhitelist to k8s

https://gerrit.wikimedia.org/r/1140482

Change #1140482 merged by Hnowlan:

[operations/puppet@production] mw::maintenance: migrate purgeExpiredBlocks to k8s

https://gerrit.wikimedia.org/r/1140482

The purgeExpiredBlocks.php job has been migrated to k8s and ran this morning - the script exited zero which is success as far as Kubernetes is concerned, but it generates no output so it's hard to verify. Are there metrics or other artifacts that can be used to verify that the script was successful post-migration?

Change #1143073 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] mw-cron/WikimediaEvents: Migrate UpdatePeriodicMetrics jobs

https://gerrit.wikimedia.org/r/1143073

Change #1143073 merged by Kamila Součková:

[operations/puppet@production] mw-cron/WikimediaEvents: Migrate UpdatePeriodicMetrics jobs

https://gerrit.wikimedia.org/r/1143073

The purgeExpiredBlocks.php job has been migrated to k8s and ran this morning - the script exited zero which is success as far as Kubernetes is concerned, but it generates no output so it's hard to verify. Are there metrics or other artifacts that can be used to verify that the script was successful post-migration?

I'm seeing output by the script (though it doesn't indicate the number of blocks that are purged). Is it literally zero output that you are seeing or just a lack of number of rows that are purged?

As to whether metrics exist, I don't see any. Perhaps a task can be filed to add this information to the output of the script?

The purgeExpiredBlocks.php job has been migrated to k8s and ran this morning - the script exited zero which is success as far as Kubernetes is concerned, but it generates no output so it's hard to verify. Are there metrics or other artifacts that can be used to verify that the script was successful post-migration?

I'm seeing output by the script (though it doesn't indicate the number of blocks that are purged). Is it literally zero output that you are seeing or just a lack of number of rows that are purged?

As to whether metrics exist, I don't see any. Perhaps a task can be filed to add this information to the output of the script?

I think I misspoke on this one, I see output for all of the jobs and it looks healthy enough for me. Apologies!

I attempted to migrate the two wikimediaevents jobs, and while the global one runs successfully, the per-wiki one logs

The "WikimediaEvents" extension must be installed for this script to run. Please enable it and then try again.

and exits 1.

The global and per-wiki job ran on the exact same image, so this does not seem like a mw-on-k8s problem. I wonder if this was already failing on bare metal. @Dreamy_Jazz could you please check on this?

(Note: As the per-wiki job runs on all.dblist, I'm wondering if this could be failing because the extension is not enabled on private wikis as per https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/89964f271ad05e13a2a44916685ac0b35c062348/wmf-config/InitialiseSettings.php#8808 .)

In T388542#10804697, @kamila wrote:

I attempted to migrate the two wikimediaevents jobs, and while the global one runs successfully, the per-wiki one logs

The "WikimediaEvents" extension must be installed for this script to run. Please enable it and then try again.

and exits 1.

The global and per-wiki job ran on the exact same image, so this does not seem like a mw-on-k8s problem. I wonder if this was already failing on bare metal. @Dreamy_Jazz could you please check on this?

(Note: As the per-wiki job runs on all.dblist, I'm wondering if this could be failing because the extension is not enabled on private wikis as per https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/89964f271ad05e13a2a44916685ac0b35c062348/wmf-config/InitialiseSettings.php#8808 .)

You should be able to change the invocation from

/usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/all.dblist extensions/WikimediaEvents/maintenance/UpdatePeriodicMetrics.php --verbose

to

/usr/local/bin/foreachwikiindblist "all - private" extensions/WikimediaEvents/maintenance/UpdatePeriodicMetrics.php --verbose

in order to skip the private wikis. The former version of foreachwikiindblist didn't have set -e, so would happily continue the loop if one of the calls exited 1.

Change #1144562 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] mw-cron/UpdatePeriodicMetrics-per-wiki: fix dblist

https://gerrit.wikimedia.org/r/1144562

In T388542#10804697, @kamila wrote:

I attempted to migrate the two wikimediaevents jobs, and while the global one runs successfully, the per-wiki one logs

The "WikimediaEvents" extension must be installed for this script to run. Please enable it and then try again.

and exits 1.

The global and per-wiki job ran on the exact same image, so this does not seem like a mw-on-k8s problem. I wonder if this was already failing on bare metal. @Dreamy_Jazz could you please check on this?

(Note: As the per-wiki job runs on all.dblist, I'm wondering if this could be failing because the extension is not enabled on private wikis as per https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/89964f271ad05e13a2a44916685ac0b35c062348/wmf-config/InitialiseSettings.php#8808 .)

You should be able to change the invocation from

/usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/all.dblist extensions/WikimediaEvents/maintenance/UpdatePeriodicMetrics.php --verbose

to

/usr/local/bin/foreachwikiindblist "all - private" extensions/WikimediaEvents/maintenance/UpdatePeriodicMetrics.php --verbose

in order to skip the private wikis. The former version of foreachwikiindblist didn't have set -e, so would happily continue the loop if one of the calls exited 1.

Oh, right, I had completely forgotten about that. Thank you, much appreciated!

Change #1144562 merged by Kamila Součková:

[operations/puppet@production] mw-cron/UpdatePeriodicMetrics-per-wiki: fix dblist

https://gerrit.wikimedia.org/r/1144562

Change #1145197 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] mw-cron/UpdatePeriodicMetrics-per-wiki: really fix dblist

https://gerrit.wikimedia.org/r/1145197

Change #1145197 merged by Kamila Součková:

[operations/puppet@production] mw-cron/UpdatePeriodicMetrics-per-wiki: really fix dblist

https://gerrit.wikimedia.org/r/1145197

Change #1147761 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] mw-cron/GlobalBlocking: fix fixGlobalBlockWhitelist dblists

https://gerrit.wikimedia.org/r/1147761

Change #1147761 merged by Kamila Součková:

[operations/puppet@production] mw-cron/GlobalBlocking: fix fixGlobalBlockWhitelist dblists

https://gerrit.wikimedia.org/r/1147761

Clement_Goubert changed the task status from Open to In Progress.May 20 2025, 3:31 PM
Clement_Goubert triaged this task as High priority.

Change #1149367 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] mw::maintenance: migrate checkuser and securepoll jobs to Kubernetes

https://gerrit.wikimedia.org/r/1149367

Change #1149367 merged by Hnowlan:

[operations/puppet@production] mw::maintenance: migrate checkuser and securepoll jobs to Kubernetes

https://gerrit.wikimedia.org/r/1149367

Change #1149422 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] mw::maintenance: migrate abusefilteripdata job to Kubernetes

https://gerrit.wikimedia.org/r/1149422

Change #1149422 merged by Hnowlan:

[operations/puppet@production] mw::maintenance: migrate abusefilteripdata job to Kubernetes

https://gerrit.wikimedia.org/r/1149422

hnowlan claimed this task.
hnowlan updated the task description. (Show Details)

Change #1149629 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw::maintenance::purge_securepoll: Only run on securepollglobal.dblist

https://gerrit.wikimedia.org/r/1149629

@kostajh @Dreamy_Jazz The purge_securepoll job failed due to running on all wikis instead of only the ones that have the extension enabled. I'm changing the invocation to use foreachwikiindblist securepollglobal.dblist instead of foreachwiki.

Change #1149629 merged by Clément Goubert:

[operations/puppet@production] mw::maintenance::purge_securepoll: Only run on securepollglobal.dblist

https://gerrit.wikimedia.org/r/1149629

Mentioned in SAL (#wikimedia-operations) [2025-05-23T10:45:37Z] <claime> Manual run of purge-securepollvotedata - T388542

Manual run seems to have worked correctly.

@kostajh @Dreamy_Jazz The purge_securepoll job failed due to running on all wikis instead of only the ones that have the extension enabled. I'm changing the invocation to use foreachwikiindblist securepollglobal.dblist instead of foreachwiki.

Thanks for the info and fixing it.

I've left a comment on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1149629 to indicate that it might be the wrong db list. It seems that is excludes all non-global wikis, but SecurePoll is enabled on wikis which are not global.

Change #1150598 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw::maintenance::purge_securepoll: Fix dblist

https://gerrit.wikimedia.org/r/1150598

Change #1150598 merged by Clément Goubert:

[operations/puppet@production] mw::maintenance::purge_securepoll: Fix dblist

https://gerrit.wikimedia.org/r/1150598

Ok so the issue is a little worse than I thought. The old wrapper didn't set -e, which meant the loop kept going when purgePrivateVoteData.php fatal'd on the extension not being installed for a wiki.

Unfortunately, the function that expands dblist expressions doesn't allow for excluding single wikis, only computing dblists together.

As a stopgap measure, I'll add an env var flag to the wrapper so it goes back to set +e when required.

I've added the necessary flag to the job and ran it manually to verify it ran correctly. It does log an error for loginwiki but keeps going as expected.