Page MenuHomePhabricator

RFC: Better interface for generating metrics in MediaWiki
Open, MediumPublic

Description

  • Affected components: MediaWiki core and extensions.
  • Engineer for initial implementation: @colewhite (WMF SRE Foundations)
  • Code stewards: TBD.

Motivation

The metrics interface in MediaWiki is outsourced and has heavy integrations with a StatsD-specific library. This situation renders StatsD metrics well enough and has served us well for quite a while, but there are some limitations to the current free-for-all approach.

  1. There is no room to leverage other metrics backends or protocols.
  2. There is no clear way to infuse orderliness or standards over what metrics are currently generated.
  3. There is little to no documentation describing the metrics MW+Extensions already generate and what they are intended to show.

The Observability team is pushing to deprecate Graphite/StatsD in favor of Prometheus. Prior attempts to use the existing model with tools available have proven difficult and are regarded as unsustainable.

Requirements

Possibly incomplete.

The system overall (from func calls in MediaWiki PHP all the way to being received by Prometheus):

  • Is overall sustainable with no variable or dynamic components between MW and Prometheus that need to be kept up to date with how or what metrics are used.

The MW side of things:

  • Abstract away the the backend implementation details of StatsD and Prometheus.
  • Support both StatsD and Prometheus (WMF will use Prometheus).
  • Remain easy to configure (toggle on via one or two simple configuration variables).
  • Remain buffered and emitted post-send.
  • Introduce no new dependencies.

Developer-facing features to keep:

  • Free-form metric names
  • Metric type: Counter.
  • Metric type: Timing.
  • Metric type: Gauge (?)

Developer-facing features to be added:

  • Metrics can be given key-value tags (separate from the canonical metric name).
  • Metrics are documented in a common way and can be browsed/discovered.

Exploration

Open questions
  • How will MediaWiki talk to Prometheus? Given Prometheus is poll-based there needs to be some intermediary service that MW pushes to and Prometheus pulls from (e.g. Statsd, Redis, etc.)
  • How will key-value tags be translated into a flat metric name for the plain Statsd use case? Do we need to?
  • How will we document the metrics? (e.g. Doxygen, Markdown, etc.)
Research

The initial proposal was to insert statsd-exporter between MW and StatsD and leverage matching rules generate appropriate Prometheus metrics. It was noted the mapping rules would be difficult to maintain and introduced a circular dependency on a sidecar service managed by Puppet.

After that, we considered what it would take to have MW (and extensions) maintain their own statsd-exporter configuration and coalesce them on deploy. This seemed a fragile, error-prone option and had the drawback of putting an unnecessary burden on developers for a single backend solution.

After that, we explored what adopting an existing Prometheus-specific library would entail. Current options have dependencies on Redis, disk, or make heavy usage of APC. None of these options seemed great from the reliability, resource utilization, or current state of library development.

After that, we went back to statsd-exporter and found support for DogStatsD, a StatsD extension that adds key:value tags to metrics and uses the same UDP transport mechanism. This option has no need for a cross-request persistent backend. This is what the current implementation demonstrates but does have a few drawbacks:

  1. This solution requires a sidecar for translation to Prometheus.
  2. In order for statsd-exporter to automatically generate meaningful Prometheus metrics, the StatsD metrics namespace has to be tightly controlled.
  3. An extra deploy step of restarting the sidecar is strongly recommended.
NOTE: The demonstration implementation does not handle the requirement to document the metrics being generated. Discussion on how to address this point in a sustainable way is welcome and requested.

Event Timeline

colewhite created this task.Apr 1 2020, 7:50 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 1 2020, 7:50 PM
Krinkle updated the task description. (Show Details)Apr 1 2020, 10:14 PM
CDanis added a subscriber: CDanis.Apr 8 2020, 3:23 PM
Joe added a subscriber: Joe.Apr 8 2020, 3:23 PM
Krinkle updated the task description. (Show Details)EditedApr 8 2020, 7:40 PM
Krinkle added a subscriber: Krinkle.

I've added some requirements that I believe were implied but worth making explicit:

  • Support both StatsD and Prometheus (WMF will use Prometheus).
  • Remain easy to configure (toggle on via one or two simple configuration variables).
  • Free-form metric names.
  • Metric types: Counter and Timing. (Do we want to keep Gauge as well?)
  • (New) Metrics can be given key-value tags (separate from the canonical metric name).
  • (New) Metrics are documented in a common way and can be browsed/discovered.

@colewhite Does that cover things? Anything unsure or still missing?

Krinkle updated the task description. (Show Details)Apr 8 2020, 7:46 PM
TK-999 added a subscriber: TK-999.Apr 10 2020, 11:29 PM
dpifke added a subscriber: dpifke.Apr 14 2020, 8:33 PM
Krinkle moved this task from P1: Define to P2: Resource on the TechCom-RFC board.May 3 2020, 11:16 PM

@colewhite Per T249164#6041408, check if I missed any requirements. Also for this phase, it's time to look/decide who would look after this component in the long term (e.g. maintenance, triage tasks, address regressions/release blockers).

@Krinkle that list looks right to me.

Krinkle renamed this task from RFC: A more defined interface for generating Metrics in MediaWiki to RFC: Better interface for generating metrics in MediaWiki.May 5 2020, 11:17 PM

For reference here:

Patch set uploaded by @colewhite via T240685:

[mediawiki/core] (WIP) Implement a Metrics interface

https://gerrit.wikimedia.org/r/585032

Following up since it's been a while. Is there anything that we can do to help move this forward?

Following up since it's been a while. Is there anything that we can do to help move this forward?

Per my last comment:

[…] for this phase, it's time to look/decide who would look after this component in the long term (e.g. maintenance, triage tasks, address regressions/release blockers).

I suppose you may want to talk with one of the Product and Technology teams that do backend development in MediaWiki. Such as Product Infra, Core Platform, etc. If one of them is willing take stewardship over this, then we can move forward with further fleshing this out.

jlinehan added a comment.EditedAug 20 2020, 3:00 PM

[…] for this phase, it's time to look/decide who would look after this component in the long term (e.g. maintenance, triage tasks, address regressions/release blockers).

I suppose you may want to talk with one of the Product and Technology teams that do backend development in MediaWiki. Such as Product Infra, Core Platform, etc. If one of them is willing take stewardship over this, then we can move forward with further fleshing this out.

@Krinkle @colewhite I think (but cannot guarantee) that this is something that could be resourced to the newly-formed team within Product Infrastructure specifically for handling data (so-called "Product Data Engineering" team). This seems to fit with the part of our mission aimed at providing tools etc. for product instrumentation support, e.g. client error logging and analytics instrumentation. I will raise this for discussion and follow up on this ticket.

Tagging @sdkim and @dcipoletti for visibility and adding this to our tracking board to make sure we follow up.

mpopov added a subscriber: mpopov.Aug 21 2020, 8:31 PM
Krinkle assigned this task to jlinehan.Sep 18 2020, 3:00 AM

@jlinehan Awesome. Once filled in, proceed to phase 3 :-)