Page MenuHomePhabricator

Enable querying operational (prometheus) metrics via the WMF Data Platform
Open, Needs TriagePublic

Description

mw.track 'stats.' topic prefix is a MediaWiki JavaScript API that is used to submit operational metrics from browsers to Prometheus, via a Puppet managed varnish Kafka producer called statsv.

Product teams often choose to emit product related metrics via mw.track and statsv to take advantage of WMF's only supported publicly available dashboarding tool: Grafana. This metric data is stored in Prometheus, where it can also be used for automated alerting.

When teams to this, the data they emit is not available along side of all the 800+ datasets in the WMF Data Lake.

If we were able to query operational metrics via the WMF Data Platfrom, they would be available for joining with other data lake hosted datasets, and for building dashboards in Superset.

Use Cases

for Product Analytics it would be really helpful if the data was available in data lake and could be accessed/reported with Superset (which we know how to use).

For various reasons, Web team decided to use statsv for instrumenting their small-scale experiments and thus ended up being a Grafana dashboard for the analyst to use (since we have no expertise on the team with that platform) – ref. T374965#10180679

Possible solutions

Option 1: Produce OpenMetric/Prometheus compatible events via mw.track

As originally proposed at {T355837#10230113}, statsv.py was suggested to use the OpenMetric event as the canonical data, and then produce metrics to Prometheus. This is still an option, but in subsequent comments there were reasons this was not desirable.

Instead, a new mw.trackSubscribe( 'stats.' ) JS handler function could do the following

  • convert the mw.track('stats.*', ...) params into an OpenTelementry/Prometheus compatible JSON event, with data model something like this POC event schema
  • sendBeacon POST /beacon/v2/event (possibly using usual EventLogging JS API?)
  • /beacon/v2/event -> eventgate-analytics-external -> kafka
  • events will then automatically make it into a Hive table.

Pros

Cons

  • Emits the data twice from browsers, once for /beacon/statsv (dogstatsd format) and once for /beacon/v2/events (JSON event format).
  • Compatibility: Does not support HTML and CSS-based (non JavaScript) instrumentation. EventGate does not support HTTP GET. (NOTE: If GET support is a requirement, it could be added to EventGate).
Option 2: produce OpenTelemetry event only

The first con could be overcome if we were to rely on the OpenTelemetry event format as the only data emitted, and transform that in statsv to produce to Prometheus. This was the original proposal in {T355837#10230113}

Pros

  • data is only emitted once from clients
  • removes CDN (varnishkafka) logic

Cons

  • Interferes with 'Tier 1 telemetry'
Option 3: mw.track('metric.*', ...) topic prefix

Another alternative would be to provide a separate mw.track topic prefix for non 'tier 1 telemetry' metrics, e.g. mw.track('metric.*', ...) Product teams could then use this mechanism instead of the 'stats.' topic prefix when they emit product metric telemetry data.

A service similar to statsv.py would consume these metric events and produce them to prometheus.

Pros

  • data is only emitted once from clients
  • existent usages of mw.track('stats.*', ...) would not be available in the Data Lake.
  • A new service (if not in statsv.py) to maintain
Option 4: dogstatsd kafka topic event transformer

Originally proposed in {T355837#10416983}.

After T355837, mw.track('stats.* ... ) ends up producing dogstastd formatted metrics to Kafka.

Something (stastv.py or another service) could consume and transform this metric data into an OpenMetric/Prometheus compatible event and produce it to an Event Platform stream(s). This stream(s) would then be automatically ingested into Data Lake table(s).

Pros

  • metric data is only emitted once from browsers.
  • Supports HTML and CSS-based instrumentation without EventGate modifications

Cons

  • A new service to (if not in statsv.py) to maintain.
Option 5: Prometheus metric kafka topic event transformer

Originally proposed in {T390328#10854337}

Prometheus has a remote_write configuration that will allow it to emit (samples?) of metrics elsewhere for long term storage. It has both HTTP (protobuf) and Kafka (JSON) support.

Once the data is in Kafka, something could consume and transform the metric data into an OpenMetric/Prometheus compatible event and produce it to an Event Platform stream(s). This stream(s) would then be automatically ingested into Data Lake table(s).

Pros

  • Any Prometheus metric could be collected, not just those emitted via mw.track

Cons

  • A new service to (if not in statsv.py) to maintain.
Option 6: T347430: [Data Platform] Install a Prometheus connector for Presto, pointed at thanos-query

This allows Presto to query metrics directly in Prometheus using SQL.

Pros

  • Any Prometheus metric can be queried, not just those emitted via mw.track
  • No extra services or data products to maintain
  • As of 2025-06, already done!

Cons

  • Analytical queries could overload prometheus
  • Operational metrics are not captured as event data; they cannot be used in stream processing or with distributed Data Lake query engines (like Spark).
  • Presto is great for working with small/medium-ish datasets. It will not be so good for joining large datasets (e.g. webrequest) together with operational metrics.

Event Timeline

Ottomata renamed this task from Produce MediaWiki client emitted operational metrics into Event Platform to Produce MediaWiki client emitted operational metrics into Event Platform, allowing them to be queried in the WMF Data Lake with SQL.Mar 28 2025, 8:22 PM

CC @Kappakayala let's talk about this during hypothesis writing for next fiscal year.

lmata moved this task from Inbox to Radar on the observability board.

Another option might be to ingest Prometheus metrics, from Prometheus. Compared to options 1-3, this approach would avoid the would-be analogous issues from T120242 and T249745 by not trying to capture and send everything in the critical path, which inherently adds client-side overhead and creates unavoidable server-side discrepencies given no atomicity between the two outputs.

This is inspired by the "outbox" proposal from Eric and @Ottomata at T120242.

You'd be able to copy, ingest, reconcile, retry at any level and any time, without data loss, and with perfect eventual consistency.

It also means the solution would work for all Prometheus metrics. Remember: Only a tiny subset of MediaWiki Prometheus metrics originate in client-side JavaScript. The vast majority of MediaWiki Prometheus metrics are sent by MediaWiki PHP and don't come through mw.track JS or Statsv at all.

Ingesting everything from prod may be a bit much (low signal-noise), so you may want a config file that lists metric names of interest. We already configure Prometheus with various "recording rules" that forward and pre-process a subset of raw metrics for rapid access, this would be a bit like that. And also similar to the refine_sanitize job (event_sanitized_analytics_allowlist.yaml), in that it moves a selected set of data from one place to another for long-term and wider access.

Beyond MediaWiki - this approach would instantly gain you access to metrics from any production service. Once you're on the other side of Prometheus, there's nothing special about MediaWiki. You can ingest metrics from any service, without someone needing to write instrumentation. Plus, ingestion could even be self-serviced by anyone who's interested in the metric. In the same way that anyone can create a dashboard in Grafana and plot something from Prometheus (i.e without technical changes that would require approval from a service owner).

Hm! Yeah maybe! Do you know if Prometheus supports something like that?

Feel free to edit the task and add more potential options!

What comes to mind is Prometheus' support for remote write. In other words Prometheus pushes a subset of metrics to an HTTP endpoint with a protobuf payload. In that case we would use relabeling to only send a subset of ingested metrics (e.g. mw); changing said subset wouldn't be fully self-service (i.e. a puppet patch is required). To be clear, I'm not endorsing this solution vs others in the task description, it is certainly possible though

Interesting. There is a Kafka adapter too!

We'd have to transform the prometheus Kafka adapter JSON format into something Event Platform compatible, similar to Option 4.

I like the fact that any configured metric in prometheus could be captured, not just the ones emitted by MW via statsv. I'll edit the task with this as Option 5.

Ottomata added subscribers: BTullis, CDanis.

I just added Option 6: T347430: [Data Platform] Install a Prometheus connector for Presto, pointed at thanos-query to the task description.

I think if we can promote this and show that it works for some use cases in question, this solution is probably sufficient! And it is already done! (Thank you @BTullis, @CDanis, @fgiunchedi !)

Ottomata renamed this task from Produce MediaWiki client emitted operational metrics into Event Platform, allowing them to be queried in the WMF Data Lake with SQL to Enable querying operational (prometheus) metrics via the WMF Data Platform.Jul 15 2025, 8:05 PM
Ottomata updated the task description. (Show Details)

I've updated the task name and description to avoid the event platform based solution. There are pros to getting this data into the data platform, but it is quite likely that just making it queryable alongside of other datasets is enough!