* Affected components: MediaWiki core and extensions.
* Engineer for initial implementation: @colewhite (WMF SRE Foundations)
* Code stewards: TBD.
### Motivation
The metrics interface in MediaWiki is outsourced and has heavy integrations with a StatsD-specific library. This situation renders StatsD metrics well enough and has served us well for quite a while, but there are some limitations to the current free-for-all approach.
1. There is no room to leverage other metrics backends or protocols.
1. There is no clear way to infuse orderliness or standards over what metrics are currently generated.
1. There is little to no documentation describing the metrics MW+Extensions already generate and what they are intended to show.
The Observability team is pushing to deprecate Graphite/StatsD in favor of Prometheus. [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/481110 | Prior attempts ]] to use the existing model with tools available have proven difficult and are regarded as unsustainable.
##### Requirements
//Possibly incomplete.//
The system overall (from func calls in MediaWiki PHP all the way to being received by Prometheus):
* Is overall sustainable with no variable or dynamic components between MW and Prometheus that need to be kept up to date with how or what metrics are used.
The MW side of things:
* Abstract away the the backend implementation details of StatsD and Prometheus.
* Support both StatsD and Prometheus (WMF will use Prometheus).
* Remain easy to configure (toggle on via one or two simple configuration variables).
* Remain buffered and emitted post-send.
* Introduce no new dependencies.
Developer-facing features to keep:
* Free-form metric names
* Metric type: Counter.
* Metric type: Timing.
* Metric type: Gauge (?)
Developer-facing features to be added:
* Metrics can be given key-value tags (separate from the canonical metric name).
* Metrics are documented in a common way and can be browsed/discovered.
-------
### Exploration
##### Open questions
* How will MediaWiki talk to Prometheus? Given Prometheus is poll-based there needs to be some intermediary service that MW pushes to and Prometheus pulls from (e.g. Statsd, Redis, etc.)
* How will key-value tags be translated into a flat metric name for the plain Statsd use case? Do we need to?
* How will we document the metrics? (e.g. Doxygen, Markdown, etc.)
##### Research
The initial proposal was to insert statsd-exporter between MW and StatsD and leverage matching rules generate appropriate Prometheus metrics. It was noted the mapping rules would be difficult to maintain and introduced a circular dependency on a sidecar service managed by Puppet.
After that, we considered what it would take to have MW (and extensions) maintain their own statsd-exporter configuration and coalesce them on deploy. This seemed a fragile, error-prone option and had the drawback of putting an unnecessary burden on developers for a single backend solution.
After that, we explored what adopting an existing Prometheus-specific library would entail. Current options have dependencies on Redis, disk, or make heavy usage of APC. None of these options seemed great from the reliability, resource utilization, or current state of library development.
After that, we went back to statsd-exporter and found support for DogStatsD, a StatsD extension that adds key:value tags to metrics and uses the same UDP transport mechanism. This option has no need for a cross-request persistent backend. This is what the current implementation demonstrates but does have a few drawbacks:
1. This solution requires a sidecar for translation to Prometheus.
1. In order for statsd-exporter to automatically generate meaningful Prometheus metrics, the StatsD metrics namespace has to be tightly controlled.
1. An extra deploy step of restarting the sidecar is strongly recommended.
NOTE: The demonstration implementation does not handle the requirement to document the metrics being generated. Discussion on how to address this point in a sustainable way is welcome and requested.