Migrate MediaWiki.resourceloader* metrics to statslib
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	herron
	Jan 26 2024, 5:56 PM

Description

Focusing on metrics:

MediaWiki.resourceloader_build
MediaWiki.resourceloader_build.all
MediaWiki.resourceloader.responseTime
MediaWiki.resourceloader_build.$module.sample_rate
MediaWiki.resourceloader_cache.minify_js.hit
MediaWiki.resourceloader_cache.minify_css.hit
MediaWiki.resourceloader_cache.map_js.hit
MediaWiki.resourceloader.responseTime.sample_rate
MediaWiki.resourceloader_module_transfersize_bytes.{enwiki
MediaWiki.resourceloader_cache.minify_js.miss
MediaWiki.resourceloader_cache.minify*js.{hit
MediaWiki.resourceloader_cache.minify_js.*
MediaWiki.resourceloader_cache.minify_css.miss
MediaWiki.resourceloader_cache.minify*css.{hit
MediaWiki.resourceloader_cache.minify_css.*
MediaWiki.resourceloader_cache.map_js.miss
MediaWiki.resourceloader_cache.map_js.*

Follow the migration process as outlined below.

Secure/Conduct a code review.
Deploy the changes to production via the train (https://wikitech.wikimedia.org/wiki/Deployments/Train).
Verify that the changes have been successfully implemented.
Update the dashboard by replacing the old Graphite metric with the new Prometheus metric.
Please follow the guidelines and standards outlined in the provided documentation:

https://www.mediawiki.org/wiki/Manual:Stats for detailed guidance on the conversion process.
https://drive.google.com/file/d/12yQEuOapkML1vb9MgCaX1QzbLBdXE6X2/view for a video tutorial on the conversion process.
https://docs.google.com/presentation/d/1SZWf_D3mWNX-XHN8PHYI84LDZr6GUQC2AMhZ9mQXCI0/edit#slide=id.g2795460c956_0_23 for slides on the best practices for converting metrics to statslib.

Details

Subject	Repo	Branch	Lines +/-
blameStartupRegistry: Migrate metrics to Prometheus	mediawiki/extensions/WikimediaMaintenance	master	+37 -33
ResourceLoader: Migrate `resourceloader_cache..` metric to statslib	mediawiki/core	master	+21 -14
ResourceLoader: migrate resourceloader.responseTime to statslib	mediawiki/core	master	+8 -1
ResourceLoader: Convert resourceloader_build metric to statslib	mediawiki/core	master	+8 -5

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T343020 Converting MediaWiki Metrics to StatsLib
Resolved	herron	T350591 Audit legacy mediawiki stats used in production dashboards
Open	None	T350592 EPIC: migrate in use metrics and dashboards to statslib
Resolved	DAlangi_WMF	T355960 Migrate MediaWiki.resourceloader* metrics to statslib
Open	None	T359640 mediawiki_resourceloader_build_seconds_bucket big metric on Prometheus ops
In Progress	None	T365265 Create a per-release deployment of statsd-exporter for mw-on-k8s

Event Timeline

herron created this task.Jan 26 2024, 5:56 PM

Change 993171 had a related patch set uploaded (by Herron; author: Herron):

[mediawiki/core@master] convert resourceloader_build metrics to statslib

https://gerrit.wikimedia.org/r/993171

gerritbot added a project: Patch-For-Review.Jan 26 2024, 6:10 PM

Change 993171 merged by jenkins-bot:

[mediawiki/core@master] ResourceLoader: Convert resourceloader_build metric to statslib

https://gerrit.wikimedia.org/r/993171

ReleaseTaggerBot added a project: MW-1.42-notes (1.42.0-wmf.17; 2024-02-06).Jan 31 2024, 5:00 PM

herron mentioned this in T350592: EPIC: migrate in use metrics and dashboards to statslib.Feb 1 2024, 10:17 PM

herron renamed this task from Migrate MediaWiki.resourceloader_build metrics to statslib to Migrate MediaWiki.resourceloader* metrics to statslib.Feb 2 2024, 4:28 PM

herron updated the task description. (Show Details)

Change 995254 had a related patch set uploaded (by Herron; author: Herron):

[mediawiki/core@master] ResourceLoader: migrate resourceloader.responseTime to statslib

https://gerrit.wikimedia.org/r/995254

Change 997341 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/core@master] ResourceLoader: Migrate `resourceloader_cache.*.*` metric to Prom

https://gerrit.wikimedia.org/r/997341

Change 995254 merged by jenkins-bot:

[mediawiki/core@master] ResourceLoader: migrate resourceloader.responseTime to statslib

https://gerrit.wikimedia.org/r/995254

Change 997341 merged by jenkins-bot:

[mediawiki/core@master] ResourceLoader: Migrate `resourceloader_cache.*.*` metric to statslib

https://gerrit.wikimedia.org/r/997341

Maintenance_bot removed a project: Patch-For-Review.Feb 9 2024, 7:30 PM

ReleaseTaggerBot edited projects, added MW-1.42-notes (1.42.0-wmf.18; 2024-02-13); removed MW-1.42-notes (1.42.0-wmf.17; 2024-02-06).Feb 9 2024, 8:01 PM

DAlangi_WMF assigned this task to herron.Feb 27 2024, 10:27 AM

DAlangi_WMF edited projects, added MediaWiki-Platform-Team; removed MediaWiki-Platform-Team (Radar).

DAlangi_WMF moved this task from Inbox, needs triage to Current Sprint on the MediaWiki-Platform-Team board.

@herron and I made patches to tackle RL metrics migration. I've verified that all the various keygroups (resourceloader_build, resourceloader_cache, and resourceloader_response_time...) have all been migrated.

I'll like to run this by @Krinkle for further verification before we close this task. Please do you see any traces of RL metrics still using StatsdDataFactory? If you spot one, let me know and I'll cover it. But from my side, I don't see any and if I'm accurate, this task can be resolved.

I believe the ones in WikimediaMaintenance/blameStartupRegistry are still TODO, including:

resourceloader_startup_modules
resourceloader_startup_bytes
resourceloader_module_transfersize_bytes
resourceloader_module_decodedsize_bytes

Change 1008839 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/extensions/WikimediaMaintenance@master] blameStartupRegistry: Migrate metrics to Prometheus

https://gerrit.wikimedia.org/r/1008839

gerritbot added a project: Patch-For-Review.Mar 5 2024, 11:44 AM

DAlangi_WMF claimed this task.Mar 5 2024, 3:00 PM

Follow up from @Krinkle 's comment on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMaintenance/+/1008839

I experimented with replacing these with counters which would allow us to detect when a metric stops being emitted, but it's somewhat non-intuitive. There's a relevant (rejected) bug with the foundation for a workaround in it: https://github.com/prometheus/prometheus/issues/3746

This counter:

$statsFactory->getCounter( 'resourceloader_startup_modules_total' )
  ->setLabel( 'wiki', $wikiFmt )
  ->setLabel( 'component', $componentFmt )
  ->incrementBy( $info['modules'] )

would stop incrementing a component if one went missing.

This query would likely need to be tuned, but this should give us something close to the graph the overall total gives us now: sum(max_over_time(increase(MediaWiki_resourceloader_startup_modules_total[2m])[1h:1m]) / 2)

Screenshot from 2024-03-08 00-31-42.png (715×972 px, 88 KB)

Screenshot from 2024-03-08 00-31-16.png (715×972 px, 45 KB)

Screenshot from 2024-03-08 00-30-53.png (715×972 px, 34 KB)

If we transition to counters, this means we cannot simply copyToStatsdAt() for backwards compatibility. The old calls should remain until the dashboard has been transitioned.

If we choose to stick with gauges, we will need to keep the per-component total separate from the overall total. Without it, we would not catch components going missing in a reasonable amount of time unless we somehow synchronized the lifecycles.

@colewhite Yeah, that seems a bit confusing long-term. I wouldn't want to rely so closely on the runtime and collection interval of mwmaint cronjob.

When working with incrementing counters, you can generally sum() things up across a range of label values (or wildcard) to get a total increment over a particular range of time. There is still room there for deception if the metrics aren't close to real time, e.g. if the increments are very heavily buffered or if the increments are simulated retroactively based on something we compute, it might be that the underlying data change happened in 1 moment in time, but we perceive each label incrementing slowly over the course of collecting it bit by bit, but it'll be as accurate as our collection logic, and reflects correctly when we first observed it.

With gauges, this much harder to reason about. It's perhaps somewhat analogous to host-level metrics like node_memory_Cached_bytes, node_memory_Buffers_bytes, and how there is also node_memory_MemTotal_bytes as its own metric, instead of encouraging to sum up the components that find the total.

With that, I've suggested Derick to port the Graphite metric over to Prometheus as-is, although renamed of course to fit the Prometheus conventions. We would have both resourceloader_startup_bytes{wiki, component} as well as resourceloader_startup_total_bytes{wiki}.

Change 1008839 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMaintenance@master] blameStartupRegistry: Migrate metrics to Prometheus

https://gerrit.wikimedia.org/r/1008839

Maintenance_bot removed a project: Patch-For-Review.Mar 11 2024, 9:30 PM

ReleaseTaggerBot edited projects, added MW-1.42-notes (1.42.0-wmf.22; 2024-03-12); removed MW-1.42-notes (1.42.0-wmf.18; 2024-02-13).Mar 11 2024, 10:00 PM

It was interesting to learn :) about the weirdness of gauges (especially in the scenario of no data) and how smart Graphite would fill up the gaps if we wanted to.

This patch was merged yesterday meaning it will ride this week's train, I'll leave it open until Friday before resolved.

In T355960#9622935, @DAlangi_WMF wrote:

It was interesting to learn :) about the weirdness of gauges (especially in the scenario of no data) and how smart Graphite would fill up the gaps if we wanted to.

This patch was merged yesterday meaning it will ride this week's train, I'll leave it open until Friday before resolved.

Thanks @DAlangi_WMF! 🚀 🌝

In T355960#9625163, @lmata wrote:

Thanks @DAlangi_WMF! 🚀 🌝

You're welcome. 😇 Extending your thanks to @Krinkle and @colewhite as well :), it was team work!

DAlangi_WMF closed this task as Resolved.Mar 15 2024, 10:30 AM

lmata awarded a token.Mar 16 2024, 12:02 AM

DAlangi_WMF merged a task: T359396: Migrate MediaWiki.resourceloader_build to statslib.Mar 21 2024, 11:05 AM

DAlangi_WMF merged a task: T359398: Migrate MediaWiki.resourceloader_startup_bytes to statslib.

lmata moved this task from Inbox to Done on the SRE Observability (FY2023/2024-Q3) board.Apr 4 2024, 6:11 PM

	F42451539: Screenshot from 2024-03-08 00-30-53.png
	Mar 8 2024, 12:48 AM

	F42451540: Screenshot from 2024-03-08 00-31-16.png
	Mar 8 2024, 12:48 AM

	F42451541: Screenshot from 2024-03-08 00-31-42.png
	Mar 8 2024, 12:48 AM

Migrate MediaWiki.resourceloader* metrics to statslibClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Migrate MediaWiki.resourceloader* metrics to statslib
Closed, ResolvedPublic
Actions

Related Objects
Search...