Page MenuHomePhabricator

Audit dashboards using histogram_quantile on mediawiki_WikimediaEvents_editResponseTime
Closed, ResolvedPublic

Description

Similar to T389357: Audit dashboards using histogram_quantile on big envoy metrics and move to recording rules. This is a big metric (~250k cardinality in eqiad only) and using it without recording rules + histogram_quantile means thanos will be overloaded.

I did a temporary edit to the dashboard to at least get it to load (i.e. disable queries in panels not using the recording rules) so at least we can amend it. cc @Krinkle

Summary

Searched for mediawiki_WikimediaEvents_editResponseTime_seconds_bucket and found 2 matching dashboards and 0 matching alerts.

Event Timeline

I have restored "history of backend save timing" using the rate5m recording rule now (i.e. no daily or weekly rollup, though I _think_ we can build those out of the rate5m recording rule?)

Mentioned in SAL (#wikimedia-operations) [2025-04-14T14:01:41Z] <godog> temp disable "backend time" panel using unaggregated big mediawiki metric on "reading web performance" dashboard - T391677

The mediawiki_WikimediaEvents_editResponseTime_seconds metric is first and foremost performance metric (e.g. for an SLO) about the MediaWiki backend. It stores a timer with histogram buckets, and a handful of labels that focus on areas that differentiate in terms of backend invocation (i.e. wiktitext articles vs discussion pages vs wikidata), and client expectation (entrypoint, bot vs user).

Grafana dashboard: Backend Save Timing breakdown.

This dashboard still uses Graphite at the moment due to T371102, but in attempting to convert it this week, @fgiunchedi noticed a new issue which is that it times out even for recenet data.

It seems in September and October 2024. this Prometheus metric exploded from producing 85 timeseries to now producing over 340,000 timeseries. Which means relatively simple queries like these (edit count, and p75) are often unable to be served within the query timeout due to the vast number of timeseries they have to crawl and aggregate in real time.

sum(irate(mediawiki_WikimediaEvents_editResponseTime_seconds_count[5m]))

histogram_quantile(0.75, sum by (le) (rate(mediawiki_WikimediaEvents_editResponseTime_seconds_bucket[5m])))

It seems this happened in T375496: Temp accounts Grafana Dashboard: Edit rate for anonymous IP editors, named accounts, and temp accounts (part of T357763) which added the labels wiki (x500), platform (x4), and is_mobile (x2) in order to inform the rollout of the Temporary accounts feature.

If this instrumentation is still needed, I suggest we replace this (i.e. revert the above) in favour of adding a separate counter with just these labels. This counter can be something like mediawiki_WikimediaEvents_edits_total and would not have the histogram buckets and performance labels (~85 series), but instead have only the above three labels (500x4x2=) to produce ~4000 series.

These should be able to co-exist without being matrixed into the 340,000 multiple we have today.

Change #1136708 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/extensions/WikimediaEvents@master] phpunit: Skip tests that fail locally without CentralAuth or GlobalPref

https://gerrit.wikimedia.org/r/1136708

Change #1136708 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] phpunit: Skip tests that fail locally without CentralAuth or GlobalPref

https://gerrit.wikimedia.org/r/1136708

Change #1137083 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/extensions/WikimediaEvents@master] Remove wiki label from editResponseTime in favor of edits_total

https://gerrit.wikimedia.org/r/1137083

Change #1137083 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] Remove wiki label from editResponseTime in favor of edits_total

https://gerrit.wikimedia.org/r/1137083

Remaining work: Next week, after the above has begun to roll out on group0 (and thus the new metrics exist):

For Trust & Safety Product:

  • update the Temp accounts Grafana dashboard to use the or operator in Prometheus and query the new metric so that there is visual continuity rather than data stopping and/or only existing after that date.

For MW Eng:

  • update the "Edit Count" and try to migrate away from the editResponseTime recording rule, to the site_stats_total metric.
  • update the "Backend Save Timing" dashboard and determine if the raw "editResponseTime" now performs well enough to not need the recording rule.
Krinkle triaged this task as Medium priority.

I seems the Prometheus version of the Backend Save Timing dashboard only works for the first two rows. The rest of the dashboard has invalid queries that yield no results.

https://grafana-rw.wikimedia.org/d/000000429/backend-save-timing-breakdown

E.g.

Backend save timing p75 by account type

Query:

histogram_quantile(0.75, sum by (le) (le_kubernetes_namespace:mediawiki_WikimediaEvents_editResponseTime_seconds_bucket:rate5m)) * 1000

Legend:

p75.{{user}}

There is no user label on this recording rule. None of the other recording rules have it, either. Except wiki_user which goes by wiki and user, but we generally don't need by wiki, and this is missing le instead as required to plot timing metrics.

Idem for p75 by entry point and p75 by page type.

Screenshot 2025-05-12 at 14.44.06.png (266×1 px, 52 KB)

I tried to use plain mediawiki_WikimediaEvents_editResponseTime_seconds_bucket here but this is still timing out for me. Perhaps the label explosion from before is still affecting us? Does something need to be restarted or deleted to "apply" the above cardinality reduction?

Change #1144662 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] prometheus: add more recording rules around editResponseTime

https://gerrit.wikimedia.org/r/1144662

I seems the Prometheus version of the Backend Save Timing dashboard only works for the first two rows. The rest of the dashboard has invalid queries that yield no results.

https://grafana-rw.wikimedia.org/d/000000429/backend-save-timing-breakdown

E.g.

Backend save timing p75 by account type

Query:

histogram_quantile(0.75, sum by (le) (le_kubernetes_namespace:mediawiki_WikimediaEvents_editResponseTime_seconds_bucket:rate5m)) * 1000

Legend:

p75.{{user}}

There is no user label on this recording rule. None of the other recording rules have it, either. Except wiki_user which goes by wiki and user, but we generally don't need by wiki, and this is missing le instead as required to plot timing metrics.

Idem for p75 by entry point and p75 by page type.

Screenshot 2025-05-12 at 14.44.06.png (266×1 px, 52 KB)

I tried to use plain mediawiki_WikimediaEvents_editResponseTime_seconds_bucket here but this is still timing out for me. Perhaps the label explosion from before is still affecting us? Does something need to be restarted or deleted to "apply" the above cardinality reduction?

No restarts needed, though you are correct that if a metric at some point has exploded its cardinality, and it is fine, querying for it in the past will still return the old (many) labels. re: breakdown I think the patch by @colewhite should do the trick

Right, but I thought the cardinality fix is now far enough in the past that it should not be affecting us anymore for the default time range, yet, it seems to be affecting us. Why is that?

I'd prefer to fix that rather than add more recording rules. The cardinality is now low enough that it should just work. Otherwise I'm not sure why I reduced the cardinality.

I imported and modified a dashboard to help us see what's going on. It shows that up until a couple hours ago, we were still at 46k unique timeseries.

I see that the patch to reduce cardinality got merged on 2025-04-23.

It looks like statsd-exporter was restarted at 2025-5-13 15:13Z, but I'm not sure why? Without a restart, TTL would have expired the old metrics after 30d.

Change #1144662 abandoned by Cwhite:

[operations/puppet@production] prometheus: add more recording rules around editResponseTime

Reason:

in favor of Ib66d70ac8c6e73f4263e6df2dcd353f1526961cc

https://gerrit.wikimedia.org/r/1144662

This works for me now.

I've removed the Graphite-based "break down by (user, entry, page)" panels from Backend Save Timing dashboard and created a new Backend Save Timing breakdown dashboard based on the raw mediawiki_WikimediaEvents_editResponseTime_seconds Prometheus metric.

Screenshot 2025-05-19 at 19.50.16.png (1×2 px, 585 KB)