Page MenuHomePhabricator

Add Prometheus support to statsd.js via mw.track()
Open, Needs TriagePublic

Description

mw.track() is a frontend JS feature that collects metrics produced in browsers and forwards them to Graphite.

This task is complete when there is a migration strategy in place to migrate metrics produced by mw.track() into Prometheus.

T354907#9480808

Event Timeline

Krinkle subscribed.

Shared context

mw.track() is a low-level JavaScript utility for connecting any two application components to each other via a topic string (loose coupling, no runtime dependency, no load order concern). The data is kept in an in-memory list for the duration of a page view, and can be consumed at any time by zero or more handlers via mw.trackSubscribe(topic, function).

It's about 20 lines of code and contains no "business logic" and is not where migration or faccilities would reside (it also emits no stats, metrics, or events by itself).

mw.track is used, for example, to propagate errors from widgets in the user interface, and to expose event objects (some of which EventLogging then consumes and sends to EventGate/Kafka), and for various abstract APIs within the JavaScript runtime, some of which may eventually produce one or more statistics.

WikimediaEvents/statsd.js

One user of mw.track() is the WikimediaEvents extension (specifically statsd.js), which comprises about 30 lines of code, that define the counter.* and timing.* topics via mw.trackSubscribe(), where the handler turns these into an HTTP request to a /beacon/statsv endpoint. You can think of this as the JavaScript analog to the PHP MediaWiki-libs-Stats, except that statsd.js is much much simpler. It is an optional extension (WikimediaEvents) that directly dispatches an HTTP request.

https://gerrit.wikimedia.org/g/mediawiki/extensions/WikimediaEvents/+/72792ef6292bfa664fdbe3b72d9346e4d495933f/modules/ext.wikimediaEvents/statsd.js#31

/beacon/statsv

https://wikitech.wikimedia.org/wiki/Graphite#statsv

This endpoint in turn is received by varnishkafka, and eventually consumed by statsv.py, which runs the on the Webperf host (now operated by SRE Observability). This consumes from kafka and forwards to Statsd/Graphite.

In 2017 we started discussing at T180105 how this might work in a secure way with Prometheus, such that we don't allow the world to create, muddy, or pollute data in unrelated metrics. I imagine this could take the form of a dedicated Prometheus instance (prometheus/ext, this part is already completed by SRE) and something else to avoid abuse causing visible conflicts in Grafana/Thanos. Perhaps something as simple as an enforced prefix like mwjs_.

Migration

Similar to the migration of the server-side stats from MediaWiki PHP, I believe the JS side is similarly handled per-component as well. This task is specifically for ensuring we have an alternative in the platform - not the migration of individual stats themselves.

Todo - Pending questions

  • API design for Prometheus-compatible stats from statsd.js. I.e. What topic name(s) and primitive data structure to use in statsd.js.
  • Transport layer. Do we stick with the current and minimal varnishkafka transport, with (some version of) statsd.py? If yes, do we add something like /beacon/openmetrics or /beacon/dogstatsd and handle it alongside statsv? Or do we want something different?
  • Export. Do we want statsd.py to be its own prometheus producer, like navtiming.py, or do we want it merely pass things on (after prefix enforcement) to a statsd-exporter? Or something different?
Krinkle renamed this task from Bring mw.track() metrics into Prometheus to Add Prometheus support to statsd.js via mw.track().Mar 4 2024, 5:16 PM
Krinkle moved this task from Uncategorized to statsv on the Grafana board.

Thank you for the detailed write up on this @Krinkle ! See below for my take:

  • In the context of Graphite deprecation, we have the statsv -> statsd-exporter -> ext prometheus chain in place, e.g. mw_js_deprecated_functions_count_total{cluster="webperf", function="GetCookie", instance="webperf1003:9112", job="statsv", prometheus="ext", site="eqiad"}. Full list of metrics as of today is at P58468. The prefix isn't fully enforced for all metrics though that's something that should be easy to fix.
    • This is only an interim solution IMHO, since it requires us to maintain statsd-exporter mappings at hieradata/role/common/webperf.yaml to e.g. drop some metrics or rename/adjust as needed.

Of course the proper solution is to have statsd.js / mw.track support for Prometheus metrics as you mentioned. I'll try and answer some of the open questions from my POV (some braindump ahead)

  • API: in general we'll need to have: metric type, metric name, labels and value
    • We'll need to decide how to encode the above into a query string parameter, and then decode it the same way on the server side
    • Gauge/Counter types are straightforward in terms of requirements.
    • For timings we'll have to decide between Summaries and Histograms, see also Prometheus documentation on their tradeoffs
  • Transport layer: AFAICT mw.track essentially re-uses eventlogging and then statsv.py taps into it by reading the statsv topic. I think we can keep this architecture, i.e. statsv is part of eventlogging and we add e.g. /beacon/openmetrics keeping everything else the same
  • Export: With Prometheus metrics in their own topic we can have a statsv.py version that reads said topic, decodes metrics from it and creates prometheus metrics as needed, adding observations for different metric types (gauge, counter, timer, etc)

HTH

  • Transport layer. Do we stick with the current and minimal varnishkafka transport, with (some version of) statsd.py? If yes, do we add something like /beacon/openmetrics or /beacon/dogstatsd and handle it alongside statsv? Or do we want something different?

Using the path as a signal to statsv.py which decoder to use feels reasonable. Effectively: /beacon/<format> which statsv.py (or something) reads and decides which decoder to use and/or what to do next.

  • Export. Do we want statsd.py to be its own prometheus producer, like navtiming.py, or do we want it merely pass things on (after prefix enforcement) to a statsd-exporter? Or something different?

We have more control and is a simpler pipeline if statsv.py did the decoding and Prometheus exporting itself, however it's saves us a fair bit of work if we "outsource" the Prometheus client and state management to a separate exporter. I see pros and cons to both solutions.

I did a bit of looking into OpenMetrics. AFAICT, it's a wire format that would be produced by a /metrics endpoint and decoded by a Prometheus server. AFAICT, it's not intended to be emitted as events.

Have I missed something?

For awareness, see also https://phabricator.wikimedia.org/T359178#9640223 re: statsv in the context of varnishkafka deprecation/removal.

I did a bit of looking into OpenMetrics. AFAICT, it's a wire format that would be produced by a /metrics endpoint and decoded by a Prometheus server. […]

Indeed. We need two things:

  1. An agreed-upon JS signature for mw.track( '<topic_TBD>', string|array|Object data ).
  2. A string format for HTTP beacon query strings.

This JS signature could be something like this:

// Old
mw.track( 'counter.MediaWiki.example_this.foo.bar', 1 );
mw.track( 'timing.MediaWiki.example_that.foo', 42 );

// New?
mw.track( 'stats_counter.mediawiki_example_this', [ 1, { x: 'foo', y: 'bar' } ] );
mw.track( 'stats_timing.mediawiki_example_that', [ 42, { x: 'foo' } ] );

The idea being:

  • Given we need a new topic for the breaking change, prefix it with something like stats_ while at it.
  • Squeeze value and optional labels into an array to keep the format as simple and short as possible.
  • The format is easy to map to either statsd, dogstatsd or Prometheus as desired.
  • For WMF, we'd format it in some way that we ingest via varnishkafka to statsv.py to export to Prometheus.

Optionally, one could create add abstracion layer on top. However, it's code would have to be very tiny so as to be able to ship it in the critical path. The reason we use mw.track() in the first place is so that producing stats is dependency-free. The code to wire it up and send it to /beacon can load asynchronously with low priority. That's where we can (within reason) do a lot more. That code is still loaded on every pageview as well, but it's loaded later and strongly cached behind its own versioned URL, instead of baked into the base payload.

mw.stats.counter( name, increment, labels = {} );
mw.stats.counter( 'mediawiki_example_this', 1, { x: 'foo', y: 'bar' } );

The HTTP query string format will need to be something fairly simple that we can trivially create in JavaScript (WikimediaEvents/statsd.js, where we consume the subset of mw.track topics that relate to stats). And then encode, fit within, and transmit over an HTTP beacons' query string (so ideally fairly short and without chars that require verbose encoding). And then easily decoded in statsv.py (and potentially discard invalid stuff), and turn into calls to the prometheus_client object buffer in the Python process, which then offers it up for scraping in the real Prometheus/OpenMetrics format.

The OpenMetrics line format might be suitable for the transport as well, but it would indeed be arbitrary. It wouldn't actually be pass to Promethes by statsv.py as-is. We'd only pretend to :)

We currently use this format:

StatsV Today
/beacon/statsv?MediaWiki.example_this.foo.bar=1c&MediaWiki.example_that.foo…
OpenMetrics-ish
/beacon/smth?mediawiki_example_this{x="foo",y="bar"} 1\nmediawiki_example_that
# actual fetch(), confirm via browser DevTools/Network/Request/Headers 
# /beacon/smth?mediawiki_example_this{x=%22foo%22,y=%22bar%22}%201%0Amediawiki_example_that

Both double quotes and spaces are illegal in URLs and thus forcefully encoded, even if you set a raw query string. Parsing quoted values would also require less trivial parser on the other end. Using this protocol might give the wrong impression as we'd presumably only support a very very narrow partion of that spec. Note that line breaks are legal in query strings, they can be encoded as encodeURIComponent('\n') == "%0A"

Something else?
/beacon/smth?mediawiki_example_this=1;x=foo;y=bar&mediawiki_example_that

This would be fairly trivial to parse (split by ampersand, then by semicolon and equal sign). The user-generated values could be encoded via encodeURIComponent, which naturally percent-encodes any actual = equal sign or ; semi colon, if there was one.

Feel free to use as starting point :)