Page MenuHomePhabricator

Per-page graphite metrics created for MediaWiki.rest_api_latency / rest_api_errors
Closed, ResolvedPublic

Description

After T361399 we've started creating per-page metrics under MediaWiki.rest_api_latency and MediaWiki.rest_api_errors e.g. MediaWiki.rest_api_latency._en.wikipedia.org_v3_page_pagebundle_Athletics_at_the_2024_Summer_Olympics_E2_80_93_Women_27s_5000_metres.GET.200.sample_rate which is obviously not going to work, we should group said metrics into lower cardinality ones.

I'm not 100% sure though the culprit might be https://gerrit.wikimedia.org/r/c/mediawiki/core/+/969811

Event Timeline

Change #1032401 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: blackhole MediaWiki.rest_api_latency

https://gerrit.wikimedia.org/r/1032401

fgiunchedi renamed this task from Per-page graphite metrics created for MediaWiki.rest_api_latency to Per-page graphite metrics created for MediaWiki.rest_api_latency / rest_api_errors.Thu, May 16, 9:24 AM

Change #1032401 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: blackhole MediaWiki.rest_api metrics

https://gerrit.wikimedia.org/r/1032401

I've temporarily blackholed the metrics from graphite, though of course mw should stop sending them. cc @hashar @daniel @BPirkle

Mentioned in SAL (#wikimedia-operations) [2024-05-16T09:44:01Z] <godog> clean up MediaWiki.rest_api_latency and MediaWiki.rest_api_errors - T365111

daniel changed the task status from Open to In Progress.Tue, May 21, 1:55 PM
daniel triaged this task as Unbreak Now! priority.
daniel lowered the priority of this task from Unbreak Now! to High.
daniel moved this task from Incoming (Needs Triage) to In Progress on the MW-Interfaces-Team board.

Change #1034508 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] REST: fix metrics keys

https://gerrit.wikimedia.org/r/1034508

Change #1034508 merged by jenkins-bot:

[mediawiki/core@master] REST: fix metrics keys

https://gerrit.wikimedia.org/r/1034508

Change #1034868 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@wmf/1.43.0-wmf.6] REST: fix metrics keys

https://gerrit.wikimedia.org/r/1034868

Change #1034873 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@wmf/1.43.0-wmf.5] REST: fix metrics keys

https://gerrit.wikimedia.org/r/1034873

Change #1034868 merged by jenkins-bot:

[mediawiki/core@wmf/1.43.0-wmf.6] REST: fix metrics keys

https://gerrit.wikimedia.org/r/1034868

Mentioned in SAL (#wikimedia-operations) [2024-05-22T11:42:58Z] <daniel@deploy1002> Started scap: Backport for [[gerrit:1034868|REST: fix metrics keys (T365111)]]

Mentioned in SAL (#wikimedia-operations) [2024-05-22T11:45:40Z] <daniel@deploy1002> daniel: Backport for [[gerrit:1034868|REST: fix metrics keys (T365111)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Change #1034873 merged by jenkins-bot:

[mediawiki/core@wmf/1.43.0-wmf.5] REST: fix metrics keys

https://gerrit.wikimedia.org/r/1034873

Mentioned in SAL (#wikimedia-operations) [2024-05-22T12:00:23Z] <daniel@deploy1002> Finished scap: Backport for [[gerrit:1034868|REST: fix metrics keys (T365111)]] (duration: 17m 25s)

Mentioned in SAL (#wikimedia-operations) [2024-05-22T12:01:39Z] <daniel@deploy1002> Started scap: Backport for [[gerrit:1034873|REST: fix metrics keys (T365111)]]

Mentioned in SAL (#wikimedia-operations) [2024-05-22T12:04:15Z] <daniel@deploy1002> daniel: Backport for [[gerrit:1034873|REST: fix metrics keys (T365111)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-05-22T12:18:32Z] <daniel@deploy1002> Finished scap: Backport for [[gerrit:1034873|REST: fix metrics keys (T365111)]] (duration: 16m 53s)

The fix has been backported and deployed. It should now be safe to re-enable the metrics.

fgiunchedi claimed this task.

The fix has been backported and deployed. It should now be safe to re-enable the metrics.

It is indeed and we're back to low cardinality metrics, e.g. MediaWiki.rest_api_latency._v1_revision_from_compare_to_.GET.200.median

thank you @daniel for the quick action and fix on this! Resolving

hello, a silly question from an uninitiated: will the historic "wrongly labelled" metric data will be re-bucketed, or are we to assume that there will be no metric data for the period in time in question?
To be clear: WMDE does not badly needs that data. Would be nice to have it but we do not need to have people spend weeks to get it back.

hello, a silly question from an uninitiated: will the historic "wrongly labelled" metric data will be re-bucketed, or are we to assume that there will be no metric data for the period in time in question?

The latter, i.e. no re-bucketing for the time period. Also not a silly but reasonable question!

I went back and checked the current metrics, and realized I was too hasty in cleaning up rest_api metrics, meaning that even previous and correctly bucketed metrics got cleaned up. I apologize for the disruption, metrics will correctly be recorded starting from today.