Page MenuHomePhabricator

Restore support for gauge metrics from MediaWiki PHP (post-Prometheus migration)
Open, MediumPublic

Description

Background

As an application, MediaWiki has numerous metrics that describe its state at a given moment in time. In Statsd lingo (both Prometheus and Graphite), these are "gauges".

We used to ingest these into Graphite via a central statsd service (statsd.eqiad.wmnet).

As part of the Promethues migration (T240685: MediaWiki Prometheus support, T343020: Converting MediaWiki Metrics to StatsLib), we moved to a distributed model with multiple statsd-exporter instances that are scraped by Prometheus. This appears incompatible with gauge metrics, as these require a single representation. Instead, there are now various split-brain reports from different statsd-exporter instances cross data centers and server groups.

Problem

As a developer maintaining software, I need a way to instrument code with gauges, and to then plot these gauges to a reasonable degree of accuracy, at a reasonable frequency. Let's say for sake of argument that these should be visible in Grafana within 1-2 minutes with at least a 5-min interval for recent data (we generally scrape every ~30 seconds or so, and currently use at least 1min interval after all the various layers and proxies, so this should be uncontroversial).

To my knowledge, there is no way to do this post-migration. Affected metrics and dashboards were blindly converted, but they either display "No data" or display data that isn't meaningful in any practical sense. This is because the tools available generally involve max(), min(), sum(). or avg() - which isn't meaningful across echo-ed scrapes of the same stale data, reported to Grafana as "new", when the data in question is already a complete representation.

E.g.

max (mediawiki_resourceloader_module_transfersize_bytes{component="$component", wiki="$wiki"})

This is hours or weeks out of date, with no indication of whether is outdated, or when it was collected (because it keeps echo'ing around in the system as "current" data). And, it will randomly become accurate again if/when a new value is reported somewhere in some dc/servergroup with a value that is higher than the last highest, and then become inaccurate again as soon as it goes down.

https://codesearch.wmcloud.org/search/?q=%3EgetGauge&excludeFiles=test

Solutions

Option 1: Solve at query time

Is there a way through Prometheus and Grafana to consense these disparate series into one, where for any given interval, the most recently reported value is selected. This would mean that statds-exporter knows when it last received a message relating to this metric/label combo (even if the value is the same), and that this is preserved and sent to Prometheus, and expose through a public API.

I found several timestamp-related functions, but these all report a constantly increasing timestamp as they report each re-scrape as "new". Understandably.

Option 2: Reduce statsd-exporter TTL

Unless we restart it every 1-4 minutes, this doesn't solve the underlying problem.

Option 3: Pushgateway for mw-cron

This would not solve the issue since:

  1. The code emitting the metric doesn't have to be specific to the maintenance script, but can e.g. also be called from a jobqueue job, or a deferred update after a request to mw-web.
  2. Maintenance scripts can also be run ad-hoc from mwscript-k8s and mwdebug.
  3. Gauges may be used in code that is specific (exclusive) to web requests or jobs.
Option 4: Strip extraneous labels

The high cardinality of MediaWiki-Prometheus metrics is significantly amplified by the 25X label combinations from stats-exporter instances itself:

  • ~25 instances (kubernetes_pod_name="statsd-exporter-prometheus-5696bbc69c-r6zzs", pod_template_hash="5696bbc69c", instance="10.67.145.67:9102")
    • 2 data center (site: eqiad, codfw)
    • 8 server group (kubernetes_namespace: mw-web, mw-api-ext, mw-jobrunner, etc)
  • ~3 different k8s hosts
  • * N over time, after restarts and reconfigurations.

Removing these is non-trivial since for most metrics, since "counter" metrics really are natively distributed and fragmented today. So unless we use something like a centralised pushgateway, we need to distinguish these and sum them together to get an accurate counter total.

Even if Prometheus allowed us to strip these labels somehow and treat them as the same timeseries, it would not solve the problem, because old data continues to be reported as new, with no guruantee that in any given query interval, that the "last" one selected by Prometheus is the right one.

Reducing cardinality here may be useful for its own sake (e.g. some kind of stable name for a symbolic instance rather than ephemeral names/hashes), but out of scope for this task.

Option 5: Pushgateway for everything?

Functionally this should work, but it remains to be seen whether the Go implementation of statsd-exporter could handle the load. Some kind of load balancing akin to statsite may be needed (https://wikitech.wikimedia.org/wiki/Statsd). The nature of this workloads makes it very suitable for splitting up, by metric name.

Addtionally, it may reduce reliability if these need to be transmitted cross-dc. Although this would be no worse than the status quo, and this data isn't canonical Tier 1 data anyway, where loss of any individual data point is and must be tolerated.

Option 6: Pushgateway for all gauges

Given only a handful of key metrics are gauges, it may be workable to split up the work right at the source, with local stats-exporter for counters, and a centralised pushgateway for gauges.

This could be done within MediaWiki core, and would come with the caveat of concentrating the load and cross-dc reliability. The load of only the gauges is many orders of magnitude lower than everything, so that be good enough.

Option 7: Something else

There is no reason distributed ingestion can't work, but it seems it would require a version of statds-exporter that understands the concept of a single application running in a distributed fashion, where the metrics relate to the business logic of the application rather than the "instance" of such application. CGI-applications such as web apps are generally stateless, where gauges don't relate to any given "instance" or "server" as these don't exist, the instance/server relates to Apache, or php-fpm, or statsd-exporter, not the web app itself.

There may be ways to improve this by using a distributed model in combination with a central one, as long as the implementation in question is aware of this concept, and can forward those timestamps accordingly without any added labels.

These would then end up in Prometheus in a single time series with their order and timestamps naturally working correctly. Do there exists Prometheus clients that make use of this?

Prior discussion at T228380: Tech debt: sunsetting of Graphite

I have a question about what to do with gauges in MediaWiki-on-Prometheus.

While Prometheus counters are pretty straight-forward to aggregate, I'm not sure what to do with gauges.

https://grafana.wikimedia.org/d/BvWJlaDWk/startup-manifest-size

Screenshot 2025-04-28 at 21.35.47.png (770×2 px, 112 KB)

[…] the "old" data is still newly scraped every 30 seconds.

[When] a metric has any infrastructure-level labels unrelated to the MediaWiki application, that may alternate or otherwise change over time (i.e. data center, k8s pod template), then we're going to see echos for a while of stale data.

Is there a best practice for how to query these correctly such that when multiple are found, the correct/most recent is returned for any given interval point?

[For example] apply max() as a tie-braker. This is fine when aggregating/zooming out a multiple valid data points (e.g. zoom out from 5m to 1h and pick the max from that period), however for the above problem it just means data from days or weeks ago effectivelly overwrites recent data if it happens to be higher.

[…] possibly pushgateway is appropriate if these are batch jobs? Or to switch to use an aggregation-compatible metric type?

@fgiunchedi, any ideas?

Yes in pushgateway you have a "grouping key" say for example job=foo and then can replace all metrics and their labels pushed under that grouping key.

Would statsd-exporter TTL help in this case to avoid metrics lingering around ?

Would statsd-exporter TTL help in this case to avoid metrics lingering around ?

We have the TTL set to 30d across all instances. Related discussion: T359497: StatsD Exporter: gracefully handle metric signature changes

Another example of dashboard and set of metrics that appears to have no way to reliably plot results from Prometheus:

Dashboard: ResourceLoader Bundle size

Code change tracked in T355960: Migrate MediaWiki.resourceloader* metrics to statslib.

Before (Graphite)
MediaWiki.resourceloader_module_transfersize_bytes.$wiki.$component.$module
After (Prometheus)
sum (
    mediawiki_resourceloader_module_transfersize_bytes{component=~"$Component",wiki="$Wiki"})
)

The use of sum() was suggested in the Prometheus draft by @andrea.denisse, but this means that with every relevant statsd-exporter that traffic flows to, the values get multiplied. There doesn't appear to be a reliable way to plot these, since the multiple ingestion pathways will each continue to be re-crawled. While the number of duplicates is somewhat low for mwmaint (i.e. codfw and eqiad), it does change over time (new hostname), and after the k8s-mw-cron migration, multiplication will take off even further, both at the same time and over time.

I considered changing this to avg(), which would produce a more real-looking timeseries (the values _look_ realistic and are in the right order of magnitutude) but remains "fake" and not useful, since they would continue to invisibly incorporate random days/weeks old data (I say random because it isn't consistently biased in any particular direction or in otherwise related to or controlled by MediaWiki).

Example:

Screenshot 2025-05-12 at 20.38.03.png (1×2 px, 180 KB)

The value went up in this case, but it's not clear which one is "correct". Much less how to e.g. reliably alert on gauges like these.

Screenshot 2025-05-12 at 20.38.59.png (1×2 px, 246 KB)

Event Timeline

Krinkle updated the task description. (Show Details)
lmata triaged this task as Medium priority.Jun 4 2025, 2:20 PM
lmata edited projects, added Observability-Metrics; removed SRE Observability.
lmata moved this task from Inbox to Prioritized on the Observability-Metrics board.

Thank you @Krinkle for the very detailed description of the problem and its potential solutions. We (o11y) chatted about this problem and the option I personally like the most is a variation of Option 2: Reduce statsd-exporter TTL.

Namely add the ability to statsd-exporter to override its default TTL on a per-metric-type basis. In other words we can shorten the gauge TTL to be however long we need without impact to other metric types. Bonus points if upstream is interested in said feature, although we have carried patches on top of statsd-exporter in the past and it isn't the end of the world. I'm also wondering what other statsd-exporter users are doing to tackle this exact same problem!

These metrics went missing this past week but not visible in the dashboard because the gauges keep repeating the last value with no obvious way to detect the lack of new data in Grafana.

See also: T409212: MediaWiki periodic job startupregistrystats-testwiki failed