Page MenuHomePhabricator

RFC: Better interface for generating metrics in MediaWiki
Open, MediumPublic

Description

  • Affected components: MediaWiki core and extensions.
  • Engineer for initial implementation: @colewhite (WMF SRE Foundations)
  • Code stewards: TBD.

Motivation

The metrics interface in MediaWiki is outsourced and has heavy integrations with a StatsD-specific library. This situation renders StatsD metrics well enough and has served us well for quite a while, but there are some limitations to the current free-for-all approach.

  1. There is no room to leverage other metrics backends or protocols.
  2. There is no clear way to infuse orderliness or standards over what metrics are currently generated.
  3. There is little to no documentation describing the metrics MW+Extensions already generate and what they are intended to show.

The Observability team is pushing to deprecate Graphite/StatsD in favor of Prometheus. Prior attempts to use the existing model with tools available have proven difficult and are regarded as unsustainable.

Requirements

Possibly incomplete.

The system overall (from func calls in MediaWiki PHP all the way to being received by Prometheus):

  • Is overall sustainable with no variable or dynamic components between MW and Prometheus that need to be kept up to date with how or what metrics are used.

The MW side of things:

  • Abstract away the the backend implementation details of StatsD and Prometheus.
  • Support both StatsD and Prometheus (WMF will use Prometheus).
  • Remain easy to configure (toggle on via one or two simple configuration variables).
  • Remain buffered and emitted post-send.
  • Introduce no new dependencies.

Developer-facing features to keep:

  • Free-form metric names
  • Metric type: Counter.
  • Metric type: Timing.
  • Metric type: Gauge (?)

Developer-facing features to be added:

  • Metrics can be given key-value tags (separate from the canonical metric name).
  • Metrics are documented in a common way and can be browsed/discovered.

Exploration

Open questions
  • How will MediaWiki talk to Prometheus? Given Prometheus is poll-based there needs to be some intermediary service that MW pushes to and Prometheus pulls from (e.g. Statsd, Redis, etc.)
  • How will key-value tags be translated into a flat metric name for the plain Statsd use case? Do we need to?
  • How will we document the metrics? (e.g. Doxygen, Markdown, etc.)
Research

The initial proposal was to insert statsd-exporter between MW and StatsD and leverage matching rules generate appropriate Prometheus metrics. It was noted the mapping rules would be difficult to maintain and introduced a circular dependency on a sidecar service managed by Puppet.

After that, we considered what it would take to have MW (and extensions) maintain their own statsd-exporter configuration and coalesce them on deploy. This seemed a fragile, error-prone option and had the drawback of putting an unnecessary burden on developers for a single backend solution.

After that, we explored what adopting an existing Prometheus-specific library would entail. Current options have dependencies on Redis, disk, or make heavy usage of APC. None of these options seemed great from the reliability, resource utilization, or current state of library development.

After that, we went back to statsd-exporter and found support for DogStatsD, a StatsD extension that adds key:value tags to metrics and uses the same UDP transport mechanism. This option has no need for a cross-request persistent backend. This is what the current implementation demonstrates but does have a few drawbacks:

  1. This solution requires a sidecar for translation to Prometheus.
  2. In order for statsd-exporter to automatically generate meaningful Prometheus metrics, the StatsD metrics namespace has to be tightly controlled.
  3. An extra deploy step of restarting the sidecar is strongly recommended.
NOTE: The demonstration implementation does not handle the requirement to document the metrics being generated. Discussion on how to address this point in a sustainable way is welcome and requested.

Event Timeline

Krinkle added a subscriber: Krinkle.

I've added some requirements that I believe were implied but worth making explicit:

  • Support both StatsD and Prometheus (WMF will use Prometheus).
  • Remain easy to configure (toggle on via one or two simple configuration variables).
  • Free-form metric names.
  • Metric types: Counter and Timing. (Do we want to keep Gauge as well?)
  • (New) Metrics can be given key-value tags (separate from the canonical metric name).
  • (New) Metrics are documented in a common way and can be browsed/discovered.

@colewhite Does that cover things? Anything unsure or still missing?

@colewhite Per T249164#6041408, check if I missed any requirements. Also for this phase, it's time to look/decide who would look after this component in the long term (e.g. maintenance, triage tasks, address regressions/release blockers).

Krinkle renamed this task from RFC: A more defined interface for generating Metrics in MediaWiki to RFC: Better interface for generating metrics in MediaWiki.May 5 2020, 11:17 PM

For reference here:

Patch set uploaded by @colewhite via T240685:

[mediawiki/core] (WIP) Implement a Metrics interface

https://gerrit.wikimedia.org/r/585032

Following up since it's been a while. Is there anything that we can do to help move this forward?

Following up since it's been a while. Is there anything that we can do to help move this forward?

Per my last comment:

[…] for this phase, it's time to look/decide who would look after this component in the long term (e.g. maintenance, triage tasks, address regressions/release blockers).

I suppose you may want to talk with one of the Product and Technology teams that do backend development in MediaWiki. Such as Product Infra, Core Platform, etc. If one of them is willing take stewardship over this, then we can move forward with further fleshing this out.

[…] for this phase, it's time to look/decide who would look after this component in the long term (e.g. maintenance, triage tasks, address regressions/release blockers).

I suppose you may want to talk with one of the Product and Technology teams that do backend development in MediaWiki. Such as Product Infra, Core Platform, etc. If one of them is willing take stewardship over this, then we can move forward with further fleshing this out.

@Krinkle @colewhite I think (but cannot guarantee) that this is something that could be resourced to the newly-formed team within Product Infrastructure specifically for handling data (so-called "Product Data Engineering" team). This seems to fit with the part of our mission aimed at providing tools etc. for product instrumentation support, e.g. client error logging and analytics instrumentation. I will raise this for discussion and follow up on this ticket.

Tagging @sdkim and @dcipoletti for visibility and adding this to our tracking board to make sure we follow up.

@jlinehan Awesome. Once filled in, proceed to phase 3 :-)

Should the timing metrics emitted by the new interface be mapped to Prometheus histograms or summaries? By default, statsd_exporter will map timers to summaries, and Prometheus docs warn against using summaries in aggregations,[1][2] which can be a problem when gathering metrics from multiple nodes.

What do you think? Apologies if I got this completely wrong.


[1] https://prometheus.io/docs/practices/histograms/#quantiles
[2] https://latencytipoftheday.blogspot.de/2014/06/latencytipoftheday-you-cant-average.html

My problem is that Prometheus has a completely different paradigm than statsd (pull vs. push) and just using an adapter seems wrong to me. Some metrics will have to follow the push model given their nature of the metric but there are some metrics (like the ones I need to get for Wikibase change dispatching scripts to jobs ) that would work much better with pull model. I just need a metric every minute. For example, I want to know what is the lowest id in some certain table is. I don't care which host it's going to hit, I can't certainly rely on statsd in appservers as they are dependent on request so it's wrong to do altogether.

My suggestion is to introduce an endpoint in mediawiki that produces these metrics and make prometheus exporters pull them minutely by making request to that endpoint.
It can be any of these:

  • a php file. Like metric.php, or prometheus.php
  • an action API endpoint like api.php?action=metric&format=prometheus
  • a REST API endpoint like rest.php/metric/foo.bar.dat/
  • a format in API (en par with json/xml)

I don't know which one is better here but I'm happy either way.

After introduction of that endpoint, we can slowly migrate some metrics to use prometheues endpoint and then use the adapter (with sidecar, DogStatsD, etc.) to push the rest. Does that make sense?

My suggestion is to introduce an endpoint in mediawiki that produces these metrics and make prometheus exporters pull them minutely by making request to that endpoint.
It can be any of these:

  • a php file. Like metric.php, or prometheus.php
  • an action API endpoint like api.php?action=metric&format=prometheus
  • a REST API endpoint like rest.php/metric/foo.bar.dat/
  • a format in API (en par with json/xml)

I will defer to others on which option of these best handles your use case, but I can provide context for this RFC:

The purpose of this RFC is to take the first steps to sunset Graphite. Since the only metrics interface in MediaWiki (that I am aware of) is StatsD. The first step is to define a more formal interface to generate metrics. Once the interface is formalized and adopted, the underlying technology can be abstracted so as to give us a path out of relying on StatsD and StatsD-compatible technologies.

After introduction of that endpoint, we can slowly migrate some metrics to use prometheues endpoint and then use the adapter (with sidecar, DogStatsD, etc.) to push the rest. Does that make sense?

I spent some time on this early last year so hopefully what I recall of that evaluation can help.

Providing a Prometheus endpoint which aggregates stats across requests requires a persistent cache that all requests can store metrics in. Several options are available, but the most prominent ones were Database-backed, Redis-backed, or APC-backed. None of these options seemed appealing due to the additional network calls or greatly increasing our reliance on the APC for metrics. The dogStatsdD solution appeared the best of all worlds due to:

  1. Behaving identical to StatsD with regard to transport and request-constrained lifetime.
  2. Requires zero or fewer dependencies.
  3. Does not require incredible regex heavy lifting to translate metrics into a Prometheus-compatible format
  4. Preserves and enables access to Prometheus' multidimensional features (labels).

but there are some metrics (like the ones I need to get for Wikibase change dispatching scripts to jobs ) that would work much better with pull model.

The pull model is sometimes used to generate metrics on request, but take care to keep latency to a minimum. Prometheus has a fairly aggressive timeout by default.

Removing inactive assignee from this open task. (Please update assignees on open tasks after offboarding. Thanks.)