Page MenuHomePhabricator

Migrate "startupregistrystats" maintenance script to k8s-mw-cron (mediawiki-platform-team)
Closed, ResolvedPublic

Description

Migrate MediaWiki-Platform-Team periodic mediawiki maintenance script from mwmaint to mw-cron on kubernetes.

Job nameCriticalityDone?
mediawiki_job_startupregistrystats-mediawikiwiki.timery
mediawiki_job_startupregistrystats-testwiki.timery
mediawiki_job_startupregistrystats.timery

Doc on the new platform

ServiceOps new will handle migrating the jobs, but would appreciate input from MediaWiki-Platform-Team on:

  • jobs that should be watched more
  • jobs that are low criticality and could be migrated first
  • outdated jobs that can be removed
  • any potential gotchas in the way these jobs use MediaWiki

Event Timeline

Krinkle renamed this task from Migrate mediawiki-platform-team jobs to mw-cron to Migrate "startupregistrystats" maintenance script to k8s-mw-cron (mediawiki-platform-team).Mar 24 2025, 8:32 AM
Krinkle updated the task description. (Show Details)
  • jobs that should be watched more:

Yes, the main one of the three. Details below.

  • jobs that are low criticality and could be migrated first

Yes, the "-testwiki" and "-mediawikiwiki" are great ones to try first.

  • jobs that can be removed

None.

  • any potential gotchas in the way these jobs use MediaWiki

This maintenance script is responsible for performance metrics, specifically for measuring the size and quantity of CSS/JS modules as loaded by web browsers during of page views. The metrics are written to Graphite/Prometheus.

The bottom-most row (collapsed by default) serves as indicator that the script is running and submitting metrics.

All other plots on that dashboard should stay constant and simply produce the same result the next time the script is run, except for shorlty after a train progression (Tue-Wed-Thu) on the selected wiki, or major changes (e.g. backports enabling new features, or major on-wiki gadget changes.)

I was looking at setting up T385709: Periodic job alerting, and realized you've already been onboarded to alertmanager and alerts for your team are sent via email. Do you want the same for periodic jobs, or would you rather have phabricator tasks created with your team's PHID?

Change #1131025 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] alertmanager: Add mediawiki-platform-task

https://gerrit.wikimedia.org/r/1131025

Change #1131037 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw::periodic_job: Migrate blameStartupRegistry.php

https://gerrit.wikimedia.org/r/1131037

Change #1131025 merged by Clément Goubert:

[operations/puppet@production] alertmanager: Add mediawiki-platform-task

https://gerrit.wikimedia.org/r/1131025

Mentioned in SAL (#wikimedia-operations) [2025-03-31T12:08:15Z] <claime> Deploying 1131037 mw::periodic_job: Migrate blameStartupRegistry.php - T388540

Change #1131037 merged by Clément Goubert:

[operations/puppet@production] mw::periodic_job: Migrate blameStartupRegistry.php

https://gerrit.wikimedia.org/r/1131037

Change #1132614 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw::periodic_job: Fix blameStartupRegistry.php timing

https://gerrit.wikimedia.org/r/1132614

Change #1132614 merged by Clément Goubert:

[operations/puppet@production] mw::periodic_job: Fix blameStartupRegistry.php timing

https://gerrit.wikimedia.org/r/1132614

testwiki periodic job migrated to kubernetes:

kubectl get cronjobs.batch mediawiki-main-startupregistrystats-testwiki
NAME                                           SCHEDULE     SUSPEND   ACTIVE   LAST SCHEDULE   AGE
mediawiki-main-startupregistrystats-testwiki   10 * * * *   False     0        <none>          28m
cgoubert@mwmaint1002:~$ sudo systemctl status mediawiki_job_startupregistrystats-testwiki.timer                                                                 
Unit mediawiki_job_startupregistrystats-testwiki.timer could not be found.

Now we wait until 13:10 UTC for execution

Because of a misconfiguration in the mediawiki chart, the 13:10 UTC run for testwiki was not successful. Fixes are pending and should be up for the next run.

The fixed chart was deployed in time for the 14:10 UTC run, which seems to have been successful. Logstash has the logs if you want to check. Note that the selectors don't seem to work correctly at the moment, but you can use filters for kubernetes.labels.team: mediawiki-platform and kubernetes.labels.script: blameStartupRegistry.php to scope down.

Clement_Goubert changed the task status from Open to In Progress.Apr 11 2025, 11:17 AM

[…] Logstash has the logs if you want to check. […]

From the task description at T391574:

I notice that both the link in your comment, and the link in the Phaultfinder reporter, are unsharable links. When I or someone else open them, they yield "Unable to completely restore the URL, be sure to use the share functionality."

See also https://wikitech.wikimedia.org/wiki/OpenSearch_Dashboards#Link_sharing

The fixed chart was deployed in time for the 14:10 UTC run, which seems to have been successful.

LGTM. Feel free to go ahead with the others.

[…] Logstash has the logs if you want to check. […]

From the task description at T391574:

I notice that both the link in your comment, and the link in the Phaultfinder reporter, are unsharable links. When I or someone else open them, they yield "Unable to completely restore the URL, be sure to use the share functionality."

See also https://wikitech.wikimedia.org/wiki/OpenSearch_Dashboards#Link_sharing

Updated link https://w.wiki/DocP

I'll update the alert template and wikitech doc.

Change #1138689 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] mediawiki: migrate startupregistrystats-mediawikiwiki to k8s

https://gerrit.wikimedia.org/r/1138689

Change #1138689 merged by Hnowlan:

[operations/puppet@production] mediawiki: migrate startupregistrystats-mediawikiwiki to k8s

https://gerrit.wikimedia.org/r/1138689

mediawikiwiki job migrated, appears to have run as expected also. Given how things have gone so far, would it be safe to migrate the remaining job next week?

LGTM. Feel free to go ahead with the others.

I see no need for further testing. Anytime earlier is fine too.

Change #1139020 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] mediawiki::maintenance: migrate main startupregistrystats job to k8s

https://gerrit.wikimedia.org/r/1139020

Change #1139020 merged by Hnowlan:

[operations/puppet@production] mediawiki::maintenance: migrate main startupregistrystats job to k8s

https://gerrit.wikimedia.org/r/1139020

The main blameStartupRegistry.php is currently doing its first run on k8s - logs make it look okay so far (it can be followed by running kube_env mw-cron eqiad and then kubectl logs startupregistrystats-29097395-znp8v mediawiki-main-app -f on a deploy server). I'll resolve once the run finishes if there are no errors.

hnowlan claimed this task.
hnowlan updated the task description. (Show Details)

Last run appears to have been successful - closing this ticket for now but please get in touch if anything looks awry

While Prometheus counters are pretty straight-forward to aggregate, I'm not sure what to do with gauges.

https://grafana.wikimedia.org/d/BvWJlaDWk/startup-manifest-size

Screenshot 2025-04-28 at 21.35.47.png (770×2 px, 112 KB)

Given that the mwmaint server and the mw-cron (k8s) deployment have each their own statsd-exporter instances; the old data continues to be re-reported and re-scraped with no obvious way to ignore the stale data. Even introspection with timestamp() doesn't help, here since the "old" data is still newly scraped every 30 seconds.

I suppose a one-off fix here could be to restart statsd-exporter on mwmaint, but that doesn't address the general issue that in Prometheus, if a metric has any infrastructure-level labels unrelated to the MediaWiki application that may alternate or otherwise change over time (i.e. data center, k8s pod template), then we're going to see echos for a while of stale data.

Is there a best practice for how to query these correctly such that when multiple are found, the correct/most recent is returned for any given interval point?

The above dashoard was migrated from Graphite to Prometheus by @andrea.denisse and applies max() as a tie-braker. This is fine when aggregating/zooming out a multiple valid data points (e.g. zoom out from 5m to 1h and pick the max from that period), however for the above problem it just means data from days or weeks ago effectivelly overwrites recent data if it happens to be higher.

While Prometheus counters are pretty straight-forward to aggregate, I'm not sure what to do with gauges.

https://grafana.wikimedia.org/d/BvWJlaDWk/startup-manifest-size

Screenshot 2025-04-28 at 21.35.47.png (770×2 px, 112 KB)

Given that the mwmaint server and the mw-cron (k8s) deployment have each their own statsd-exporter instances; the old data continues to be re-reported and re-scraped with no obvious way to ignore the stale data. Even introspection with timestamp() doesn't help, here since the "old" data is still newly scraped every 30 seconds.

I suppose a one-off fix here could be to restart statsd-exporter on mwmaint, but that doesn't address the general issue that in Prometheus, if a metric has any infrastructure-level labels unrelated to the MediaWiki application that may alternate or otherwise change over time (i.e. data center, k8s pod template), then we're going to see echos for a while of stale data.

Is there a best practice for how to query these correctly such that when multiple are found, the correct/most recent is returned for any given interval point?

The above dashoard was migrated from Graphite to Prometheus by @andrea.denisse and applies max() as a tie-braker. This is fine when aggregating/zooming out a multiple valid data points (e.g. zoom out from 5m to 1h and pick the max from that period), however for the above problem it just means data from days or weeks ago effectivelly overwrites recent data if it happens to be higher.

I think this issue is fundamentally related to Prometheus not having null like Graphite does.
In this case the gauge can emit the same value forever unless it's cleared (like in the statsd-exporter restart you mention).
Implementing max() served as a work-around to imply recentness in the gauge data. I wonder if using filters and/or transformations from inside Grafana could help us to overcome the null limitation.
I'm also tagging @colewhite and @fgiunchedi for their insights.

Thank you for the heads up @andrea.denisse, please see https://phabricator.wikimedia.org/T228380#10774463 for the cross-post of this issue, we can followup there