Page MenuHomePhabricator

StatsLib timings MUST be recorded as milliseconds
Closed, ResolvedPublic

Description

From the conversation in T355837#10438043, we learned that the unit to use with TimingMetric::observe is absolutely required to be milliseconds, despite the metric being keyed with the _seconds suffix, and despite us previously tracking our timings in seconds just fine.

That means we have to redo all our migrations for timing metrics, and the data collected until with staslib/prometheus now is useless.

Event Timeline

Change #1109051 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@master] fix(tracking): TimingMetric:observe records milliseconds

https://gerrit.wikimedia.org/r/1109051

Change #1109051 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] fix(tracking): TimingMetric:observe records milliseconds

https://gerrit.wikimedia.org/r/1109051

Change #1111196 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.12] fix(tracking): TimingMetric:observe records milliseconds

https://gerrit.wikimedia.org/r/1111196

Change #1111196 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.12] fix(tracking): TimingMetric:observe records milliseconds

https://gerrit.wikimedia.org/r/1111196

Mentioned in SAL (#wikimedia-operations) [2025-01-14T15:33:41Z] <Lucas_WMDE> previous deployment also included [[gerrit:rGERRIT111119698026|fix(tracking): TimingMetric:observe records milliseconds]] (T383208)

@Michael Is there anything else we need to do in this task? Should it still be in Incoming?

@Michael Is there anything else we need to do in this task? Should it still be in Incoming?

Mh, for practical purposes I would want to wait until tomorrow when we have hopefully some sensible data. Then we should be able to see if the changes here had the desired effects and decide on next steps. But I can move it to doing just as well.

I started looking into how to migrate one of our top performance metrics for Special:Homepage, "$platform rendering speed Special:Homepage (p99)", in this case I started with desktop at the platform.

But so far the data does not match at all:

image.png (313×773 px, 43 KB)

The query that is being used for Graphite:

aliasByNode(MediaWiki.timing.growthExperiments.specialHomepage.serverSideRender.desktop.p99, 4, 5, 6)

The query that is being used for Prometheus:

histogram_quantile(0.99, sum by(le) (
    rate(mediawiki_GrowthExperiments_special_homepage_server_side_render_seconds_bucket{platform="desktop"}[5m])
))

Not sure what I'm doing wrong.

This comment was removed by Michael.

Change #1113796 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@master] fix: track correct data in legacy Graphite query

https://gerrit.wikimedia.org/r/1113796

Change #1113796 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] fix: track correct data in legacy Graphite query

https://gerrit.wikimedia.org/r/1113796

Status update: currently working on creating a visualization and showing that on a dashboard

This can now be considered done. I've migrated two panels on the Special:Homepage / Suggested Edits dashboard (the first two under "image recommendation service"), and the data is compatible with the data we used to get from Graphite.

Also, I wrote down some basic notes for this kind of dashboard on https://wikitech.wikimedia.org/wiki/Grafana/Best_practices/Getting_Started_with_Thanos_panels#Timings