Page MenuHomePhabricator

Migrate MediaWiki.resourceloader* metrics to statslib
Closed, ResolvedPublic

Description

Focusing on metrics:

  • MediaWiki.resourceloader_build
  • MediaWiki.resourceloader_build.all
  • MediaWiki.resourceloader.responseTime
  • MediaWiki.resourceloader_build.$module.sample_rate
  • MediaWiki.resourceloader_cache.minify_js.hit
  • MediaWiki.resourceloader_cache.minify_css.hit
  • MediaWiki.resourceloader_cache.map_js.hit
  • MediaWiki.resourceloader.responseTime.sample_rate
  • MediaWiki.resourceloader_module_transfersize_bytes.{enwiki
  • MediaWiki.resourceloader_cache.minify_js.miss
  • MediaWiki.resourceloader_cache.minify*js.{hit
  • MediaWiki.resourceloader_cache.minify_js.*
  • MediaWiki.resourceloader_cache.minify_css.miss
  • MediaWiki.resourceloader_cache.minify*css.{hit
  • MediaWiki.resourceloader_cache.minify_css.*
  • MediaWiki.resourceloader_cache.map_js.miss
  • MediaWiki.resourceloader_cache.map_js.*

Follow the migration process as outlined below.

Secure/Conduct a code review.
Deploy the changes to production via the train (https://wikitech.wikimedia.org/wiki/Deployments/Train).
Verify that the changes have been successfully implemented.
Update the dashboard by replacing the old Graphite metric with the new Prometheus metric.
Please follow the guidelines and standards outlined in the provided documentation:

https://www.mediawiki.org/wiki/Manual:Stats for detailed guidance on the conversion process.
https://drive.google.com/file/d/12yQEuOapkML1vb9MgCaX1QzbLBdXE6X2/view for a video tutorial on the conversion process.
https://docs.google.com/presentation/d/1SZWf_D3mWNX-XHN8PHYI84LDZr6GUQC2AMhZ9mQXCI0/edit#slide=id.g2795460c956_0_23 for slides on the best practices for converting metrics to statslib.

Event Timeline

Change 993171 had a related patch set uploaded (by Herron; author: Herron):

[mediawiki/core@master] convert resourceloader_build metrics to statslib

https://gerrit.wikimedia.org/r/993171

Change 993171 merged by jenkins-bot:

[mediawiki/core@master] ResourceLoader: Convert resourceloader_build metric to statslib

https://gerrit.wikimedia.org/r/993171

herron renamed this task from Migrate MediaWiki.resourceloader_build metrics to statslib to Migrate MediaWiki.resourceloader* metrics to statslib.Feb 2 2024, 4:28 PM
herron updated the task description. (Show Details)

Change 995254 had a related patch set uploaded (by Herron; author: Herron):

[mediawiki/core@master] ResourceLoader: migrate resourceloader.responseTime to statslib

https://gerrit.wikimedia.org/r/995254

Change 997341 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/core@master] ResourceLoader: Migrate `resourceloader_cache.*.*` metric to Prom

https://gerrit.wikimedia.org/r/997341

Change 995254 merged by jenkins-bot:

[mediawiki/core@master] ResourceLoader: migrate resourceloader.responseTime to statslib

https://gerrit.wikimedia.org/r/995254

Change 997341 merged by jenkins-bot:

[mediawiki/core@master] ResourceLoader: Migrate `resourceloader_cache.*.*` metric to statslib

https://gerrit.wikimedia.org/r/997341

@herron and I made patches to tackle RL metrics migration. I've verified that all the various keygroups (resourceloader_build, resourceloader_cache, and resourceloader_response_time...) have all been migrated.

I'll like to run this by @Krinkle for further verification before we close this task. Please do you see any traces of RL metrics still using StatsdDataFactory? If you spot one, let me know and I'll cover it. But from my side, I don't see any and if I'm accurate, this task can be resolved.

I believe the ones in WikimediaMaintenance/blameStartupRegistry are still TODO, including:

  • resourceloader_startup_modules
  • resourceloader_startup_bytes
  • resourceloader_module_transfersize_bytes
  • resourceloader_module_decodedsize_bytes

Change 1008839 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/extensions/WikimediaMaintenance@master] blameStartupRegistry: Migrate metrics to Prometheus

https://gerrit.wikimedia.org/r/1008839

Follow up from @Krinkle 's comment on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMaintenance/+/1008839

I experimented with replacing these with counters which would allow us to detect when a metric stops being emitted, but it's somewhat non-intuitive. There's a relevant (rejected) bug with the foundation for a workaround in it: https://github.com/prometheus/prometheus/issues/3746

This counter:

$statsFactory->getCounter( 'resourceloader_startup_modules_total' )
  ->setLabel( 'wiki', $wikiFmt )
  ->setLabel( 'component', $componentFmt )
  ->incrementBy( $info['modules'] )

would stop incrementing a component if one went missing.

This query would likely need to be tuned, but this should give us something close to the graph the overall total gives us now: sum(max_over_time(increase(MediaWiki_resourceloader_startup_modules_total[2m])[1h:1m]) / 2)

Screenshot from 2024-03-08 00-31-42.png (715×972 px, 88 KB)

Screenshot from 2024-03-08 00-31-16.png (715×972 px, 45 KB)

Screenshot from 2024-03-08 00-30-53.png (715×972 px, 34 KB)

If we transition to counters, this means we cannot simply copyToStatsdAt() for backwards compatibility. The old calls should remain until the dashboard has been transitioned.

If we choose to stick with gauges, we will need to keep the per-component total separate from the overall total. Without it, we would not catch components going missing in a reasonable amount of time unless we somehow synchronized the lifecycles.

@colewhite Yeah, that seems a bit confusing long-term. I wouldn't want to rely so closely on the runtime and collection interval of mwmaint cronjob.

When working with incrementing counters, you can generally sum() things up across a range of label values (or wildcard) to get a total increment over a particular range of time. There is still room there for deception if the metrics aren't close to real time, e.g. if the increments are very heavily buffered or if the increments are simulated retroactively based on something we compute, it might be that the underlying data change happened in 1 moment in time, but we perceive each label incrementing slowly over the course of collecting it bit by bit, but it'll be as accurate as our collection logic, and reflects correctly when we first observed it.

With gauges, this much harder to reason about. It's perhaps somewhat analogous to host-level metrics like node_memory_Cached_bytes, node_memory_Buffers_bytes, and how there is also node_memory_MemTotal_bytes as its own metric, instead of encouraging to sum up the components that find the total.

With that, I've suggested Derick to port the Graphite metric over to Prometheus as-is, although renamed of course to fit the Prometheus conventions. We would have both resourceloader_startup_bytes{wiki, component} as well as resourceloader_startup_total_bytes{wiki}.

Change 1008839 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMaintenance@master] blameStartupRegistry: Migrate metrics to Prometheus

https://gerrit.wikimedia.org/r/1008839

DAlangi_WMF changed the task status from Open to In Progress.Mar 12 2024, 10:43 AM

It was interesting to learn :) about the weirdness of gauges (especially in the scenario of no data) and how smart Graphite would fill up the gaps if we wanted to.

This patch was merged yesterday meaning it will ride this week's train, I'll leave it open until Friday before resolved.

It was interesting to learn :) about the weirdness of gauges (especially in the scenario of no data) and how smart Graphite would fill up the gaps if we wanted to.

This patch was merged yesterday meaning it will ride this week's train, I'll leave it open until Friday before resolved.

Thanks @DAlangi_WMF! 🚀 🌝

Thanks @DAlangi_WMF! 🚀 🌝

You're welcome. 😇 Extending your thanks to @Krinkle and @colewhite as well :), it was team work!