Background
As an application, MediaWiki has numerous metrics that describe its state at a given moment in time. In Statsd lingo (both Prometheus and Graphite), these are "gauges".
We used to ingest these into Graphite via a central statsd service (statsd.eqiad.wmnet).
As part of the Promethues migration (T240685: MediaWiki Prometheus support, T343020: Converting MediaWiki Metrics to StatsLib), we moved to a distributed model with multiple statsd-exporter instances that are scraped by Prometheus. This appears incompatible with gauge metrics, as these require a single representation. Instead, there are now various split-brain reports from different statsd-exporter instances cross data centers and server groups.
Problem
As a developer maintaining software, I need a way to instrument code with gauges, and to then plot these gauges to a reasonable degree of accuracy, at a reasonable frequency. Let's say for sake of argument that these should be visible in Grafana within 1-2 minutes with at least a 5-min interval for recent data (we generally scrape every ~30 seconds or so, and currently use at least 1min interval after all the various layers and proxies, so this should be uncontroversial).
To my knowledge, there is no way to do this post-migration. Affected metrics and dashboards were blindly converted, but they either display "No data" or display data that isn't meaningful in any practical sense. This is because the tools available generally involve max(), min(), sum(). or avg() - which isn't meaningful across echo-ed scrapes of the same stale data, reported to Grafana as "new", when the data in question is already a complete representation.
E.g.
max (mediawiki_resourceloader_module_transfersize_bytes{component="$component", wiki="$wiki"})
This is hours or weeks out of date, with no indication of whether is outdated, or when it was collected (because it keeps echo'ing around in the system as "current" data). And, it will randomly become accurate again if/when a new value is reported somewhere in some dc/servergroup with a value that is higher than the last highest, and then become inaccurate again as soon as it goes down.
https://codesearch.wmcloud.org/search/?q=%3EgetGauge&excludeFiles=test
Solutions
Option 1: Solve at query time
Is there a way through Prometheus and Grafana to consense these disparate series into one, where for any given interval, the most recently reported value is selected. This would mean that statds-exporter knows when it last received a message relating to this metric/label combo (even if the value is the same), and that this is preserved and sent to Prometheus, and expose through a public API.
I found several timestamp-related functions, but these all report a constantly increasing timestamp as they report each re-scrape as "new". Understandably.
Option 2: Reduce statsd-exporter TTL
Unless we restart it every 1-4 minutes, this doesn't solve the underlying problem.
Option 3: Pushgateway for mw-cron
This would not solve the issue since:
- The code emitting the metric doesn't have to be specific to the maintenance script, but can e.g. also be called from a jobqueue job, or a deferred update after a request to mw-web.
- Maintenance scripts can also be run ad-hoc from mwscript-k8s and mwdebug.
- Gauges may be used in code that is specific (exclusive) to web requests or jobs.
Option 4: Strip extraneous labels
The high cardinality of MediaWiki-Prometheus metrics is significantly amplified by the 25X label combinations from stats-exporter instances itself:
- ~25 instances (kubernetes_pod_name="statsd-exporter-prometheus-5696bbc69c-r6zzs", pod_template_hash="5696bbc69c", instance="10.67.145.67:9102")
- 2 data center (site: eqiad, codfw)
- 8 server group (kubernetes_namespace: mw-web, mw-api-ext, mw-jobrunner, etc)
- ~3 different k8s hosts
- * N over time, after restarts and reconfigurations.
Removing these is non-trivial since for most metrics, since "counter" metrics really are natively distributed and fragmented today. So unless we use something like a centralised pushgateway, we need to distinguish these and sum them together to get an accurate counter total.
Even if Prometheus allowed us to strip these labels somehow and treat them as the same timeseries, it would not solve the problem, because old data continues to be reported as new, with no guruantee that in any given query interval, that the "last" one selected by Prometheus is the right one.
Reducing cardinality here may be useful for its own sake (e.g. some kind of stable name for a symbolic instance rather than ephemeral names/hashes), but out of scope for this task.
Option 5: Pushgateway for everything?
Functionally this should work, but it remains to be seen whether the Go implementation of statsd-exporter could handle the load. Some kind of load balancing akin to statsite may be needed (https://wikitech.wikimedia.org/wiki/Statsd). The nature of this workloads makes it very suitable for splitting up, by metric name.
Addtionally, it may reduce reliability if these need to be transmitted cross-dc. Although this would be no worse than the status quo, and this data isn't canonical Tier 1 data anyway, where loss of any individual data point is and must be tolerated.
Option 6: Pushgateway for all gauges
Given only a handful of key metrics are gauges, it may be workable to split up the work right at the source, with local stats-exporter for counters, and a centralised pushgateway for gauges.
This could be done within MediaWiki core, and would come with the caveat of concentrating the load and cross-dc reliability. The load of only the gauges is many orders of magnitude lower than everything, so that be good enough.
Option 7: Something else
There is no reason distributed ingestion can't work, but it seems it would require a version of statds-exporter that understands the concept of a single application running in a distributed fashion, where the metrics relate to the business logic of the application rather than the "instance" of such application. CGI-applications such as web apps are generally stateless, where gauges don't relate to any given "instance" or "server" as these don't exist, the instance/server relates to Apache, or php-fpm, or statsd-exporter, not the web app itself.
There may be ways to improve this by using a distributed model in combination with a central one, as long as the implementation in question is aware of this concept, and can forward those timestamps accordingly without any added labels.
These would then end up in Prometheus in a single time series with their order and timestamps naturally working correctly. Do there exists Prometheus clients that make use of this?


