Page MenuHomePhabricator

Metric anomaly in resourceloader_build_seconds after 26 June 2025
Closed, ResolvedPublic

Description

I'm monitoring server load when enabling unified mobile routing on the largest wikis this and next week (T403510). I was expecting to see a temporary rise in ResourceLoader builds while we re-populate the CDN for mobile versions on the standard domain.

The Grafana dashboard claims that the "startup" module is built only 2-3 times per minute, whereas this is usually ~1000 per minute.

Screenshot 2025-10-01 at 11.22.16.png (1×2 px, 229 KB)

Zooming out shows that something broke around 26 June 2025. This suggests to me that this isn't a "real" change, but rather an issue with the instrumentation having (almost) stopped working.

https://grafana.wikimedia.org/d/000000067/resourceloader-module-builds?orgId=1&from=now-6M&to=now&timezone=utc&var-module=startup&viewPanel=panel-17

Screenshot 2025-10-01 at 11.20.46.png (1×2 px, 325 KB)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Tracking down the train deployment for that week (wmf.7) led me to https://www.mediawiki.org/wiki/MediaWiki_1.45/wmf.7, and it appears that I merged a related patch: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1159547 that touches this metric.

I do not see how the patch could have broken that metric yet (I am still investigating), but that is the only match I can find for now. Locally (before and after), I can confirm that resourcealoader_build_seconds are still emitted, and I can see metrics related to the startup module too.

@DAlangi_WMF I think that's the culprit indeed. That commit uses TimerMetric::start followed by TimerMetric::stop. This seems likely a case of T368073: TimingMetric and recursion edge case.

Does Module::buildContent recurse? I happen to know this, but this can be found emperically by debugging and detecting recursion or (if it doesn't happen locally, or you want a quicker answer) by looking at flame graphs from production.

https://performance.wikimedia.org/arclamp/svgs/daily/2025-10-20.excimer-wall.load.svgz?s=buildContent

StartupModule::buildContent calls Module::buildContent by way of getVersionHash. StartupModule doesn't recurse into itself, but it does (in one flat loop) call all other modules to compute their version hash. For modules that use Module::enableModuleContentVersion that means calling Module::buildContent. According to the flame graph, one example of that is VisualEditor.

LocalSetting
# $ socat -u UDP-RECVFROM:8125,fork,SO-RCVBUF=65536 -
$wgStatsTarget = 'udp://127.0.0.1:8125';
$wgStatsFormat = 'dogstatsd';

wfLoadExtension('VisualEditor');

http://localhost:4000/load.php?modules=startup&only=scripts&raw=1

…
mediawiki.resourceloader_build_seconds:2.731458|ms|#name:ext_visualEditor_data
mediawiki.resourceloader_build_seconds:58.686666|ms|#name:ext_visualEditor_data
…

Instead of reporting a timer for startup and one for ext_visualEditor_data, both are attributed to ext_visualEditor_data.

I suspect this is what happened in production as well.

Change #1198571 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/core@master] resourceloader: Use `observe()` on with recursive timing metric

https://gerrit.wikimedia.org/r/1198571

DAlangi_WMF changed the task status from Open to In Progress.Oct 24 2025, 5:12 PM
DAlangi_WMF claimed this task.

Change #1199036 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] stats: restore stats recursion tests

https://gerrit.wikimedia.org/r/1199036

Change #1199037 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] stats: add getLabels() function

https://gerrit.wikimedia.org/r/1199037

Change #1199038 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] stats: have RunningTimer set the labels

https://gerrit.wikimedia.org/r/1199038

Change #1199036 merged by jenkins-bot:

[mediawiki/core@master] stats: restore stats recursion tests

https://gerrit.wikimedia.org/r/1199036

Change #1199037 merged by jenkins-bot:

[mediawiki/core@master] Stats: add getLabels() function

https://gerrit.wikimedia.org/r/1199037

Change #1198571 abandoned by D3r1ck01:

[mediawiki/core@master] resourceloader: Use `setLabel()` after `start()` with recursive metric

Reason:

Fixed properly in the stats framework in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1199038.

https://gerrit.wikimedia.org/r/1198571

Change #1199038 merged by jenkins-bot:

[mediawiki/core@master] Stats: have RunningTimer manage the initial label set

https://gerrit.wikimedia.org/r/1199038

Change #1199854 had a related patch set uploaded (by D3r1ck01; author: Cwhite):

[mediawiki/core@wmf/1.45.0-wmf.25] Stats: add getLabels() function

https://gerrit.wikimedia.org/r/1199854

Change #1199856 had a related patch set uploaded (by D3r1ck01; author: Cwhite):

[mediawiki/core@wmf/1.45.0-wmf.25] Stats: have RunningTimer manage the initial label set

https://gerrit.wikimedia.org/r/1199856

Change #1200009 had a related patch set uploaded (by D3r1ck01; author: Cwhite):

[mediawiki/core@wmf/1.45.0-wmf.24] Stats: add getLabels() function

https://gerrit.wikimedia.org/r/1200009

Change #1200009 abandoned by D3r1ck01:

[mediawiki/core@wmf/1.45.0-wmf.24] Stats: add getLabels() function

https://gerrit.wikimedia.org/r/1200009

Change #1199854 merged by jenkins-bot:

[mediawiki/core@wmf/1.45.0-wmf.25] Stats: add getLabels() function

https://gerrit.wikimedia.org/r/1199854

Change #1199856 merged by jenkins-bot:

[mediawiki/core@wmf/1.45.0-wmf.25] Stats: have RunningTimer manage the initial label set

https://gerrit.wikimedia.org/r/1199856

DAlangi_WMF reassigned this task from DAlangi_WMF to colewhite.
DAlangi_WMF added a subscriber: colewhite.

Backports were deployed in https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T2000, but something interesting happened, and I filed T408868: Log deployment activities on auto-detected changes during deployment.

Also, I can confirm that the metrics are back up again, thank you @colewhite, I'll really assign this to you instead, as you really fixed the issue, and I only helped verify and code review. Extending thanks to @Krinkle for all the help and support here as well.