Page MenuHomePhabricator

Migrate discovery-search jobs to mw-cron
Closed, ResolvedPublic

Description

Migrate Discovery-Search periodic mediawiki jobs from mwmaint to mw-cron on kubernetes.

Job nameCriticalityDone?
mediawiki_job_cirrus_build_completion_indices_codfw.timerMY
mediawiki_job_cirrus_build_completion_indices_eqiad.timerMY
mediawiki_job_wikidata-updateQueryServiceLag.timerY

Doc on the new platform

serviceops will handle migrating the jobs, but would appreciate input from Discovery-Search on:

  • jobs that should be watched more
  • jobs that are low criticality and could be migrated first
  • outdated jobs that can be removed
  • any potential gotchas in the way these jobs use MediaWiki

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Clement_Goubert the move of CirrusSearch maint scripts to mwscript-k8s is blocked on T382398 and I suspect that mw-cron might have the same issue, is this something you could have a workaround for?

@Clement_Goubert the move of CirrusSearch maint scripts to mwscript-k8s is blocked on T382398 and I suspect that mw-cron might have the same issue, is this something you could have a workaround for?

The MwScript.php wrapper now waits for the envoy proxy to be up before proceeding since T387208: Ensure tls-proxy container is started before launching main container. In our tests it's been working fine, we should test the CirrusSearch use case to validate it's ok. Sorry I missed T382398 when resolving the task.

I ran a test invocation to see how it would work and it seems to have worked as expected:

mwscript-k8s --attach extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php -- --wiki=testwiki --masterTimeout=10m --replicationTimeout=5400 --indexChunkSize 3000 --cluster=eqiad

One bit i wonder about is the parallelization. We currently implement foreachwiki manually using xargs. Run this way with a parallelism of 4 it typically runs from ~2:30 to 9:00 (6.5 hours). The general goal here was to complete the rebuilds during the less busy part of the day. The runtime implies if it was run without parallelism it might take > 24 hours to complete the daily build. I suppose that wouldn't be the end of the world, as long as wikis are run in order they would still see a rebuild once a day, but it seems a bit awkward.

/usr/local/bin/expanddblist all | xargs -I{} -P 4 sh -c "/usr/local/bin/mwscript extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php --wiki={} --masterTimeout=10m --replicationTimeout=5400 --indexChunkSize 3000 --cluster=$1 --optimize 2>&1 | ts '{}'"

That xargs invoc should still work under mw-cron, we would need to add ts to the image but that's about it. I tested it inside a mw-debug pod manually with Version.php and it seems to do the right thing

/usr/local/bin/expanddblist all | xargs -I{} -P 4 sh -c "/usr/local/bin/mwscript Version.php --wiki={} 2>&1"

It may be a little tricky to escape the command correctly for the different templating layers, so if it gets too wild, we may want to write a small helper script for it. I'll do some testing with our test job to see how it goes.

Change #1133089 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw::periodic_jobs: Test xargs parallelism

https://gerrit.wikimedia.org/r/1133089

Change #1133089 merged by Clément Goubert:

[operations/puppet@production] mw::periodic_jobs: Test xargs parallelism

https://gerrit.wikimedia.org/r/1133089

Change #1133092 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw::periodic_jobs: Fix parallel invocation

https://gerrit.wikimedia.org/r/1133092

Change #1133092 merged by Clément Goubert:

[operations/puppet@production] mw::periodic_jobs: Fix parallel invocation

https://gerrit.wikimedia.org/r/1133092

Ok, after a few test changes (escaping interleaved quotes through 3 template languages is no fun), I simplified the invocation for my test to

/usr/local/bin/expanddblist large | xargs -I{} -P4 /usr/local/bin/mwscript Version.php --wiki={}

and that works as intended.

I'm just now realizing that the sh -c invocation is so | ts '{}' works, but as logs will be sent to logstash and as such, timestamped, do we need this part of the call @EBernhardson ?

The other important detail that ts adds is it prefixes the wiki name to the logs, with 4 of them running in parallel this is important to keep the logs meaningful, we can't just check what the last wiki to start was by going backwards in the log.

Clement_Goubert changed the task status from Open to In Progress.May 20 2025, 3:18 PM

Change #1148374 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw::maintenance: Migrate wikidata_resubmit_changes_for_dispatch

https://gerrit.wikimedia.org/r/1148374

I ran a test invocation to see how it would work and it seems to have worked as expected:

mwscript-k8s --attach extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php -- --wiki=testwiki --masterTimeout=10m --replicationTimeout=5400 --indexChunkSize 3000 --cluster=eqiad

One bit i wonder about is the parallelization. We currently implement foreachwiki manually using xargs. Run this way with a parallelism of 4 it typically runs from ~2:30 to 9:00 (6.5 hours). The general goal here was to complete the rebuilds during the less busy part of the day. The runtime implies if it was run without parallelism it might take > 24 hours to complete the daily build. I suppose that wouldn't be the end of the world, as long as wikis are run in order they would still see a rebuild once a day, but it seems a bit awkward.

/usr/local/bin/expanddblist all | xargs -I{} -P 4 sh -c "/usr/local/bin/mwscript extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php --wiki={} --masterTimeout=10m --replicationTimeout=5400 --indexChunkSize 3000 --cluster=$1 --optimize 2>&1 | ts '{}'"

Coming back to this, we could also shard that script by database section and parallelize on that, what do you think?

Change #1148374 merged by Clément Goubert:

[operations/puppet@production] mw::maintenance: Migrate wikidata_resubmit_changes_for_dispatch

https://gerrit.wikimedia.org/r/1148374

Coming back to this, we could also shard that script by database section and parallelize on that, what do you think?

If this is generic functionality that gets exposed i think it would be perfectly reasonable. 4-way parallelism was an arbitrary choice.

Coming back to this, we could also shard that script by database section and parallelize on that, what do you think?

If this is generic functionality that gets exposed i think it would be perfectly reasonable. 4-way parallelism was an arbitrary choice.

Yes, we have a puppet resource called profile::mediawiki::sharded_periodic_job that takes an array of database sections s{1..8} and creates a periodic job for each of them. That means we would have 8 instances of the job running in parallel, each on a section. This is not an even split, as some sections have a lot of smaller wikis, and other have less but bigger ones.

Change #1149366 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw::maintenance: Migrate wikidata-updateQueryServiceLag to mw-cron

https://gerrit.wikimedia.org/r/1149366

Change #1149368 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw::maintenance: Migrate cirrus_build_completion_indices to mw-cron

https://gerrit.wikimedia.org/r/1149368

Change #1149368 merged by Hnowlan:

[operations/puppet@production] mw::maintenance: Migrate cirrus_build_completion_indices to mw-cron

https://gerrit.wikimedia.org/r/1149368

Change #1149366 merged by Clément Goubert:

[operations/puppet@production] mw::maintenance: Migrate wikidata-updateQueryServiceLag to mw-cron

https://gerrit.wikimedia.org/r/1149366

wikidata-updateQueryServiceLag needs an egress networkpolicy to reach prometheus, reverted to mwmaint for now.

cgoubert@deploy1003:~$ kubectl logs cirrus-build-completion-indices-codfw-s8-29132790-wc6dx mediawiki-main-app 
extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php: Start run
extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php: Running on s8.dblist
Completion suggester disabled, quitting...
cgoubert@deploy1003:~$ kubectl logs cirrus-build-completion-indices-codfw-s5-29132790-dtbtg mediawiki-main-app
[...]
<Error, collected 1 message(s) on the way, no value set>
+----------+---------------------------+--------------------------------------+
| error    | Looks like the index has  |                                      |
|          | more than one identifier. |                                      |
|          |  You should delete all
bu |                                      |
|          | t the one of them current |                                      |
|          | ly active. Here is the li |                                      |
|          | st: nupwiki_titlesuggest_ |                                      |
|          | 1746597241,nupwiki_titles |                                      |
|          | uggest_1746683639         |                                      |
+----------+---------------------------+--------------------------------------+

nupwiki Inferring index identifier...error

@Clement_Goubert thanks!

cgoubert@deploy1003:~$ kubectl logs cirrus-build-completion-indices-codfw-s8-29132790-wc6dx mediawiki-main-app 
extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php: Start run
extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php: Running on s8.dblist
Completion suggester disabled, quitting...

If it's wikidata or testwikidata this is expected. Is this going to cause noise on your side? If yes we could consider exiting with 0 instead of an error.

cgoubert@deploy1003:~$ kubectl logs cirrus-build-completion-indices-codfw-s5-29132790-dtbtg mediawiki-main-app
[...]
<Error, collected 1 message(s) on the way, no value set>
+----------+---------------------------+--------------------------------------+
| error    | Looks like the index has  |                                      |
|          | more than one identifier. |                                      |
|          |  You should delete all
bu |                                      |
|          | t the one of them current |                                      |
|          | ly active. Here is the li |                                      |
|          | st: nupwiki_titlesuggest_ |                                      |
|          | 1746597241,nupwiki_titles |                                      |
|          | uggest_1746683639         |                                      |
+----------+---------------------------+--------------------------------------+

nupwiki Inferring index identifier...error

It happens sometimes... the system is usually able to self heal but here it failed to do so because: Broken index nupwiki_titlesuggest_1746683639 appears to be in use, please check and delete.

Is mwcron aware of the wiki it's running, if yes could it add a new label to log entries to make filtering by wiki slightly easier in logstash?

@Clement_Goubert thanks!

cgoubert@deploy1003:~$ kubectl logs cirrus-build-completion-indices-codfw-s8-29132790-wc6dx mediawiki-main-app 
extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php: Start run
extensions/CirrusSearch/maintenance/UpdateSuggesterIndex.php: Running on s8.dblist
Completion suggester disabled, quitting...

If it's wikidata or testwikidata this is expected. Is this going to cause noise on your side? If yes we could consider exiting with 0 instead of an error.

It's when trying to run on s8, so wikidata, yes. I could also just remove s8 from the shards the script is running on?

cgoubert@deploy1003:~$ kubectl logs cirrus-build-completion-indices-codfw-s5-29132790-dtbtg mediawiki-main-app
[...]
<Error, collected 1 message(s) on the way, no value set>
+----------+---------------------------+--------------------------------------+
| error    | Looks like the index has  |                                      |
|          | more than one identifier. |                                      |
|          |  You should delete all
bu |                                      |
|          | t the one of them current |                                      |
|          | ly active. Here is the li |                                      |
|          | st: nupwiki_titlesuggest_ |                                      |
|          | 1746597241,nupwiki_titles |                                      |
|          | uggest_1746683639         |                                      |
+----------+---------------------------+--------------------------------------+

nupwiki Inferring index identifier...error

It happens sometimes... the system is usually able to self heal but here it failed to do so because: Broken index nupwiki_titlesuggest_1746683639 appears to be in use, please check and delete.

Is mwcron aware of the wiki it's running, if yes could it add a new label to log entries to make filtering by wiki slightly easier in logstash?

Not in a structured way yet, it's something we need to think about. However these jobs are running through mwscriptwikiset which prepends the wiki to every log line, meaning a query like this would get you a specific wiki run.

By the way, here are the completion times for the different shards:

cirrus-build-completion-indices-codfw-s1-29132790               1/1           72m
cirrus-build-completion-indices-codfw-s2-29132790               1/1           3h58m
cirrus-build-completion-indices-codfw-s3-29132790               0/1           6h6m # still running right now
cirrus-build-completion-indices-codfw-s4-29132790               1/1           2m37s
cirrus-build-completion-indices-codfw-s5-29132790               0/1           6h6m # failed
cirrus-build-completion-indices-codfw-s6-29132790               1/1           67m
cirrus-build-completion-indices-codfw-s7-29132790               1/1           3h13m
cirrus-build-completion-indices-codfw-s8-29132790               0/1           6h6m # failed

It's when trying to run on s8, so wikidata, yes. I could also just remove s8 from the shards the script is running on?

Sure no need to run on s8 indeed.

Not in a structured way yet, it's something we need to think about. However these jobs are running through mwscriptwikiset which prepends the wiki to every log line, meaning a query like this would get you a specific wiki run.

Thanks!

Trying to fix nupwiki but something's very wrong there, I suspect it's because this wiki is kind of new.

Regarding the timings this does sound correct to me, the build used to end around 9am UTC, if it's taking slightly longer it's not an issue.

Change #1149613 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Add netpol for prometheus HTTP

https://gerrit.wikimedia.org/r/1149613

Change #1149614 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw::maintenance::cirrussearch: Skip s8

https://gerrit.wikimedia.org/r/1149614

dcausse added a subscriber: hoo.

pinging @hoo & Wikidata for visibility on the work on mediawiki_job_wikidata-updateQueryServiceLag.timer

pinging @hoo & Wikidata for visibility on the work on mediawiki_job_wikidata-updateQueryServiceLag.timer

Should that job alert to Wikidata rather than Discovery-Search ?

pinging @hoo & Wikidata for visibility on the work on mediawiki_job_wikidata-updateQueryServiceLag.timer

Should that job alert to Wikidata rather than Discovery-Search ?

I believe so? but @hoo please let us know if you think otherwise.

Change #1149613 merged by Clément Goubert:

[operations/deployment-charts@master] mediawiki: Add netpol for prometheus HTTP

https://gerrit.wikimedia.org/r/1149613

Mentioned in SAL (#wikimedia-operations) [2025-05-23T09:50:12Z] <cgoubert@deploy1003> Started scap sync-world: 1149613: mediawiki: Add netpol for prometheus HTTP - T388538

Mentioned in SAL (#wikimedia-operations) [2025-05-23T09:52:16Z] <cgoubert@deploy1003> Finished scap sync-world: 1149613: mediawiki: Add netpol for prometheus HTTP - T388538 (duration: 03m 11s)

Change #1149623 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] Revert^2 "mw::maintenance: Migrate wikidata-updateQueryServiceLag to mw-cron"

https://gerrit.wikimedia.org/r/1149623

Change #1149623 merged by Clément Goubert:

[operations/puppet@production] Revert^2 "mw::maintenance: Migrate wikidata-updateQueryServiceLag to mw-cron"

https://gerrit.wikimedia.org/r/1149623

wikidata-updatequeryservicelag looks to be working fine

cgoubert@deploy1003:/srv/deployment-charts/helmfile.d/services/mw-cron$ kubectl logs wikidata-updatequeryservicelag-29133249-swm22 mediawiki-main-app 
extensions/Wikidata.org/maintenance/updateQueryServiceLag.php: Start run
Got lag of: 115 for host: wdqs1012.
Stored in cache with TTL of: 70.
extensions/Wikidata.org/maintenance/updateQueryServiceLag.php: Finished run

Change #1149614 merged by Clément Goubert:

[operations/puppet@production] mw::maintenance::cirrussearch: Skip s8

https://gerrit.wikimedia.org/r/1149614

Clement_Goubert claimed this task.

pinging @hoo & Wikidata for visibility on the work on mediawiki_job_wikidata-updateQueryServiceLag.timer

Should that job alert to Wikidata rather than Discovery-Search ?

{{done}}

cgoubert@deploy1003:~$ kubectl logs cirrus-build-completion-indices-codfw-s5-29132790-dtbtg mediawiki-main-app
[...]
<Error, collected 1 message(s) on the way, no value set>
+----------+---------------------------+--------------------------------------+
| error    | Looks like the index has  |                                      |
|          | more than one identifier. |                                      |
|          |  You should delete all
bu |                                      |
|          | t the one of them current |                                      |
|          | ly active. Here is the li |                                      |
|          | st: nupwiki_titlesuggest_ |                                      |
|          | 1746597241,nupwiki_titles |                                      |
|          | uggest_1746683639         |                                      |
+----------+---------------------------+--------------------------------------+

nupwiki Inferring index identifier...error

Similar issue happened again on s3 for eowikisource

eowikisource 2025-05-28 05:54:53 <OK, collected 1 message(s) on the way, bool value set>
eowikisource +----------+---------------------------+--------------------------------------+
eowikisource | warning  | Broken index eowikisource |                                      |
eowikisource |          | _titlesuggest_1746628286  |                                      |
eowikisource |          | appears to be in use, ple |                                      |
eowikisource |          | ase check and delete.     |                                      |
eowikisource +----------+---------------------------+--------------------------------------+
eowikisource Inferring index identifier...error
eowikisource <Error, collected 1 message(s) on the way, no value set>
eowikisource +----------+---------------------------+--------------------------------------+
eowikisource | error    | Looks like the index has  |                                      |
eowikisource |          | more than one identifier. |                                      |
eowikisource |          |  You should delete all
eowikisource bu |                                      |
eowikisource |          | t the one of them current |                                      |
eowikisource |          | ly active. Here is the li |                                      |
eowikisource |          | st: eowikisource_titlesug |                                      |
eowikisource |          | gest_1709179749,eowikisou |                                      |
eowikisource |          | rce_titlesuggest_17466282 |                                      |
eowikisource |          | 86                        |                                      |
eowikisource +----------+---------------------------+--------------------------------------+
eowikisource

Should we configure the jobs to keep going with the next wiki on error? Right now we're exiting the loop on non-0 exit codes.
cc @hnowlan @Scott_French @kamila to merge if needed as I'm ooo for the next two days.

Change #1151749 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw::maintenance::cirrussearch: foreachwiki ignore errors

https://gerrit.wikimedia.org/r/1151749

Should we configure the jobs to keep going with the next wiki on error? Right now we're exiting the loop on non-0 exit codes.
cc @hnowlan @Scott_French @kamila to merge if needed as I'm ooo for the next two days.

Yes, we should continue on with the next wiki on error.

Change #1151749 merged by Clément Goubert:

[operations/puppet@production] mw::maintenance::cirrussearch: foreachwiki ignore errors

https://gerrit.wikimedia.org/r/1151749

Change #1151755 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] sharded_periodic_job: Allow setting foreachwiki_ignore_errors

https://gerrit.wikimedia.org/r/1151755

Change #1151755 merged by Clément Goubert:

[operations/puppet@production] sharded_periodic_job: Allow setting foreachwiki_ignore_errors

https://gerrit.wikimedia.org/r/1151755

Change #1151760 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] sharded_periodic_job: Fix false case

https://gerrit.wikimedia.org/r/1151760

Change #1151760 merged by Clément Goubert:

[operations/puppet@production] sharded_periodic_job: Fix false case

https://gerrit.wikimedia.org/r/1151760

Should we configure the jobs to keep going with the next wiki on error? Right now we're exiting the loop on non-0 exit codes.
cc @hnowlan @Scott_French @kamila to merge if needed as I'm ooo for the next two days.

Yes, we should continue on with the next wiki on error.

{{done}}

It looks to still be having issues, in particular the s3 job has been OOMKilled a few times recently and isn't completing a full build.

I took a brief look through the charts, possibly these jobs are using the main_app.requests.auto_compute=true option in the mediawiki chart, but it wasn't clear to me if there was a way to set per-job requests.

Moving back to reported, as we are tracking resolving the memory issue in T395465

Since the issue is tracked elsewhere, and the jobs are effectively migrated, I'm resolving this task.