Page MenuHomePhabricator

MediaWiki Prometheus support
Open, HighPublic

Description

As part of T228380, MediaWiki metrics need to make it into Prometheus.

This task is complete when:

  • MediaWiki is capable of exporting Prometheus-style metrics
  • These metrics are consumed by the production Prometheus
  • All MediaWiki metrics and module metrics are shipped in this manner
  • All dashboards are updated to use these new metrics
  • All previous code changes needed for this are merged and deployed in production.

High level checklist:

Details

SubjectRepoBranchLines +/-
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+9 -2
operations/deployment-chartsmaster+36 -0
operations/deployment-chartsmaster+39 -0
operations/puppetproduction+8 -1
mediawiki/coremaster+14 -4
mediawiki/coremaster+124 -101
mediawiki/coremaster+18 -27
mediawiki/coremaster+106 -3
mediawiki/coremaster+132 -29
mediawiki/coremaster+14 -15
mediawiki/coremaster+33 -0
mediawiki/coremaster+128 -306
mediawiki/coremaster+207 -100
mediawiki/coremaster+887 -401
mediawiki/coremaster+594 -139
mediawiki/coremaster+0 -0
mediawiki/coremaster+6 -6
mediawiki/coremaster+436 -431
mediawiki/coremaster+347 -495
mediawiki/coremaster+64 -14
mediawiki/coremaster+92 -0
mediawiki/coremaster+32 -16
mediawiki/coremaster+65 -65
mediawiki/coremaster+40 -34
mediawiki/coremaster+75 -69
operations/puppetproduction+8 -1
operations/puppetproduction+4 -0
mediawiki/coremaster+19 -4
mediawiki/coremaster+60 -0
mediawiki/coremaster+22 -0
mediawiki/coremaster+12 -2
mediawiki/coremaster+1 K -0
mediawiki/coremaster+996 -7
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Just trying to figure out the status of this feature from the comments.
Would anyone in the know be able to write a summary?

Would anyone in the know be able to write a summary?

Yes!

As I understand it, the next steps are:

  1. Configure and deploy StatsD Exporter on MediaWiki instances.
    1. Need direction on where to start. Testwiki? Beta? Canaries?
  2. Configure Prometheus to pull metrics from new StatsD Exporter instances.
  3. Update MediaWiki config to enable the library.
  4. Begin updating metrics calls to use the new library.
    1. Where to start? Maintenance jobs? MW-core?
  5. As needed: update dashboards.

I'm having trouble understanding what DogStatsD does and how it fits in. It is described on the linked webpage as "The easiest way to get your custom application metrics into Datadog" which is not a problem that I thought we had. Is it open source? Where is the source of it? Do they support this kind of use case? How much value does it provide compared to a DIY solution?

I reviewed the interface, and I have thoughts.

Why do names have dots on the wire, e.g. MediaWiki.SomeExtension.name_of_metric? Does DogStatsD convert the dots to underscores for Prometheus? If so, what is the point of preventing dots in MetricUtils::validateConfig()? The name policy is apparently the only thing preventing the old interface from becoming a backwards compatible wrapper around the new interface. I know it's not necessary with your migration plan, but allowing it would open up some more options for migration.

The old interface has getPerDbNameStatsdDataFactory(), and the new interface requires you to provide an associative array with "extension" provided every time a metric is updated. Both of these imply that the migrated call site is going to be verbose in the new interface and that wrappers will be required.

The term "extension" is awkward because there are callers that aren't extensions. In wfDeprecated() we use the term "component" which might be a better fit here.

I think it overuses associative arrays. Formal parameters can carry types both in static analysis and at runtime, and have better IDE integration. The flexibility advantage of associative arrays will mostly disappear in PHP 8.0 with the introduction of named parameters.

TimingMetric could be improved by having it actually measure time. Most callers don't want to be fiddling with microtime(true).

I think I would add a MediaWiki-specific stateful factory which would carry name components and labels, for the caller's convenience.

Something like

$serviceWiring = [
    'MyExtension.MetricFactory' => static function ( $services ) {
         // chainable + immutable, mutator-like methods clone the object
        return $services->getMetricFactory()->extension( 'MyExtension' );
    }
];

namespace MediaWiki\Extension\MyExtension;

class SpecialThing {
    public function __construct( $myExtensionMetricFactory ) {
        $this->metricFactory = $myExtensionMetricFactory
            ->wiki( /* default = current wiki */ )
            ->name( 'Thing' )
            ->label( 'color', 'green' );
    }

    public function edit() {
        $timer = $this->metricFactory->startTimer( 'edit' );
        ...
        // emit statsd metric MediaWiki.$wiki.MyExtension.Thing.green.edit
        // or prometheus metric MediaWiki_MyExtension_Thing_edit{wiki=$wiki, color=green}
        $timer->stop();
    }
}

Hey, thanks for following up! These are great questions.

I'm having trouble understanding what DogStatsD does and how it fits in. It is described on the linked webpage as "The easiest way to get your custom application metrics into Datadog" which is not a problem that I thought we had.

You are correct, integration with Datadog is not the end goal. DogStatsD has the benefit of looking and behaving like StatsD on the wire, but extends the StatsD schema to include tagging which turn into Prometheus labels. This buys us the ability to deploy a statsd_exporter sidecar with no custom parsing configuration and we get Prometheus metrics out of it. We explored the possibility of configuring the statsd_exporter sidecar parsing the currently-generated metrics, but found the configuration quite lengthy and unsustainable as metrics are added/removed/changed.

Is it open source? Where is the source of it? Do they support this kind of use case?

DogStatsD is a datagram format that extends StatsD with tagging. References to code seem to be the datadog client which is Apache2 licensed.

How much value does it provide compared to a DIY solution?

This answer is fairly lengthy. In short, we already use statsd_exporter elsewhere, it understands dogstatsd, and reliably converts these datagram packets into Prometheus metrics.

I reviewed the interface, and I have thoughts.

Why do names have dots on the wire, e.g. MediaWiki.SomeExtension.name_of_metric? Does DogStatsD convert the dots to underscores for Prometheus? If so, what is the point of preventing dots in MetricUtils::validateConfig()?

Yes, the dots are converted to underscores by statsd_exporter. Dots are not supported as Prometheus metric or label names. This validation step is to maintain label integrity when StatsD is rendered:

# Prometheus
MediaWiki_SomeExtension_name_of_metric{label_1="value_1", label_2="value_2"}
# StatsD
MediaWiki.SomeExtension.name_of_metric.label_1.value_1.label_2.value_2

Without maintaining name and label integrity, it is possible these metrics could be generated by the same call:

# Prometheus
MediaWiki_SomeExtension_name_of_metric{label_1="value_1", label_2="value_2"}
# StatsD
MediaWiki.SomeExtension.name.of.metric.label.1.value.1.label.2.value.2

The name policy is apparently the only thing preventing the old interface from becoming a backwards compatible wrapper around the new interface. I know it's not necessary with your migration plan, but allowing it would open up some more options for migration.

The rationale for the metric library behavior is that Prometheus needs separation between metric name and labels where StatsD has no such need. The new library collects the data in a way Prometheus expects it and tries its best to digest it into a bare StatsD-compatible format. I think it's possible to provide an override setting that would allow users to set arbitrary metric namespaces to only be used by StatsD, but it feels fairly hacky.

The old interface has getPerDbNameStatsdDataFactory(), and the new interface requires you to provide an associative array with "extension" provided every time a metric is updated. Both of these imply that the migrated call site is going to be verbose in the new interface and that wrappers will be required.

Oof. Thanks for pointing this out. I was unaware of this abstraction. Seems like we would have to create a service that mirrors this pattern since it is config data being incorporated by ServiceWiring.

The term "extension" is awkward because there are callers that aren't extensions. In wfDeprecated() we use the term "component" which might be a better fit here.

This is a determination I think a core dev should make. The whole point of "extension" was to enforce uniqueness early on in the metric names to stem the possibility of naming conflicts. Metric naming conflicts are still possible in both formats due to the lack of a namespace validation feedback loop.

TimingMetric could be improved by having it actually measure time. Most callers don't want to be fiddling with microtime(true).

Good point! This seems like it would be an easy feature to add.

I think it overuses associative arrays. Formal parameters can carry types both in static analysis and at runtime, and have better IDE integration. The flexibility advantage of associative arrays will mostly disappear in PHP 8.0 with the introduction of named parameters.

I think I would add a MediaWiki-specific stateful factory which would carry name components and labels, for the caller's convenience.

Something like

$serviceWiring = [
    'MyExtension.MetricFactory' => static function ( $services ) {
         // chainable + immutable, mutator-like methods clone the object
        return $services->getMetricFactory()->extension( 'MyExtension' );
    }
];

namespace MediaWiki\Extension\MyExtension;

class SpecialThing {
    public function __construct( $myExtensionMetricFactory ) {
        $this->metricFactory = $myExtensionMetricFactory
            ->wiki( /* default = current wiki */ )
            ->name( 'Thing' )
            ->label( 'color', 'green' );
    }

    public function edit() {
        $timer = $this->metricFactory->startTimer( 'edit' );
        ...
        // emit statsd metric MediaWiki.$wiki.MyExtension.Thing.green.edit
        // or prometheus metric MediaWiki_MyExtension_Thing_edit{wiki=$wiki, color=green}
        $timer->stop();
    }
}

The overuse of associative arrays were an attempt to keep us from getting painted into a corner. I am very much interested in ways to improve the interface.

TIL extensions can register services. I like this.

From @tstarling's comment, I see a few action items:

  1. Create a service that can be a drop-in replacement for getPerDbNameStatsdDataFactory().
  2. Add timing helper functions so that users do not need to supply their own timestamps to TimingMetric.
  3. Explore the possibility of Builder pattern to reduce dependence on associative arrays and enable extensions to register their own services with prepopulated settings and base labels.

Questions to answer:

  1. Do we want to explore the option of an override parameter so that the new metrics library can emit StatsD metrics at the old namespace name?
  2. Should "extension" as a uniqueness constraint be renamed to "component"?

From @tstarling's comment, I see a few action items:

  1. Create a service that can be a drop-in replacement for getPerDbNameStatsdDataFactory().

I think more specifically services->getStatsdDataFactory(): BufferingStatsdDataFactory which is what the vast majority of MW production code uses. The per-dbname wrapper is fairly rarely used in practice (only Wikibase and Growth experiment). Supporting the per-dbname wrapper would be nice, but imho less urgent.

Change 854630 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: rename extension to component

https://gerrit.wikimedia.org/r/854630

Change 854631 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: expand Sample instance parameters

https://gerrit.wikimedia.org/r/854631

Change 854632 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: move metric implementations to subdirectory

https://gerrit.wikimedia.org/r/854632

Change 855127 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: add metric property accessors

https://gerrit.wikimedia.org/r/855127

Change 855128 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: implement MetricsCache

https://gerrit.wikimedia.org/r/855128

Change 855129 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: move format configuration to separate class

https://gerrit.wikimedia.org/r/855129

Change 855130 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: refactor rendering interface

https://gerrit.wikimedia.org/r/855130

Change 855626 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: simplify MetricUtils, introduce BaseMetricInterface

https://gerrit.wikimedia.org/r/855626

Change 855627 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: wire up MetricsUDPEmitter

https://gerrit.wikimedia.org/r/855627

Change 855633 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: simplify metrics configuration, enforce builder pattern

https://gerrit.wikimedia.org/r/855633

Change 857057 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: use MetricsInterface where needed

https://gerrit.wikimedia.org/r/857057

Change 857060 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: add timing start and stop helper functions

https://gerrit.wikimedia.org/r/857060

Change 857061 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: add static labels feature

https://gerrit.wikimedia.org/r/857061

Change 857062 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: add copy to statsd feature

https://gerrit.wikimedia.org/r/857062

Change 854630 merged by jenkins-bot:

[mediawiki/core@master] Metrics: rename extension to component

https://gerrit.wikimedia.org/r/854630

Change 854631 merged by jenkins-bot:

[mediawiki/core@master] Metrics: expand Sample instance parameters

https://gerrit.wikimedia.org/r/854631

Change 854632 merged by jenkins-bot:

[mediawiki/core@master] Metrics: move metric implementations to subdirectory

https://gerrit.wikimedia.org/r/854632

Change 855127 merged by jenkins-bot:

[mediawiki/core@master] Metrics: add metric property accessors

https://gerrit.wikimedia.org/r/855127

Change 855128 merged by jenkins-bot:

[mediawiki/core@master] Metrics: implement MetricsCache class

https://gerrit.wikimedia.org/r/855128

Change 855129 merged by jenkins-bot:

[mediawiki/core@master] Metrics: move format configuration to separate class

https://gerrit.wikimedia.org/r/855129

Change 855130 merged by jenkins-bot:

[mediawiki/core@master] Metrics: refactor rendering interface

https://gerrit.wikimedia.org/r/855130

Change 855626 merged by jenkins-bot:

[mediawiki/core@master] Metrics: simplify MetricUtils, introduce BaseMetricInterface

https://gerrit.wikimedia.org/r/855626

Change 891868 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: rename Metrics lib to Stats

https://gerrit.wikimedia.org/r/891868

Change 855627 merged by jenkins-bot:

[mediawiki/core@master] Metrics: refactor emitter instantiation

https://gerrit.wikimedia.org/r/855627

Change 891868 merged by jenkins-bot:

[mediawiki/core@master] Metrics: rename Metrics lib to Stats

https://gerrit.wikimedia.org/r/891868

Change 891882 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Stats: move stats library into Stats folder

https://gerrit.wikimedia.org/r/891882

Change 891882 merged by jenkins-bot:

[mediawiki/core@master] Stats: move stats library into Stats folder

https://gerrit.wikimedia.org/r/891882

Change 893056 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/core@master] tests: Move stats library into Stats folder

https://gerrit.wikimedia.org/r/893056

Change 893056 merged by jenkins-bot:

[mediawiki/core@master] tests: Move stats library into Stats folder

https://gerrit.wikimedia.org/r/893056

Change 894577 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Stats: make component optional

https://gerrit.wikimedia.org/r/894577

Change 855633 merged by jenkins-bot:

[mediawiki/core@master] Stats: simplify metrics configuration, enforce builder pattern

https://gerrit.wikimedia.org/r/855633

Change 857057 merged by jenkins-bot:

[mediawiki/core@master] Stats: use MetricsInterface where needed

https://gerrit.wikimedia.org/r/857057

Change 857060 merged by jenkins-bot:

[mediawiki/core@master] Stats: add timing start and stop helper functions

https://gerrit.wikimedia.org/r/857060

Change 857061 merged by jenkins-bot:

[mediawiki/core@master] Stats: add static labels feature

https://gerrit.wikimedia.org/r/857061

Change 857062 merged by jenkins-bot:

[mediawiki/core@master] Stats: add copy to statsd feature

https://gerrit.wikimedia.org/r/857062

Change 900706 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] mediawiki: provision statsd_exporter on canary_appserver

https://gerrit.wikimedia.org/r/900706

Change 894577 merged by jenkins-bot:

[mediawiki/core@master] Stats: make component optional

https://gerrit.wikimedia.org/r/894577

Change 955015 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/mediawiki-config@master] Add StatsLib settings for Test env

https://gerrit.wikimedia.org/r/955015

Change 955016 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Convert executeTiming to StatsLib

https://gerrit.wikimedia.org/r/955016

Change 955016 merged by jenkins-bot:

[mediawiki/core@master] MediaWiki.php: Convert executeTiming metric to new Stats library

https://gerrit.wikimedia.org/r/955016

Change 900706 abandoned by Cwhite:

[operations/puppet@production] profile: provision statsd_exporter on canary_appserver

Reason:

superseded by T345377

https://gerrit.wikimedia.org/r/900706

Change 972342 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mw-debug: add statsd-exporter

https://gerrit.wikimedia.org/r/972342

Change 972343 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mediawiki: add statsd exporter

https://gerrit.wikimedia.org/r/972343

Change 972342 merged by jenkins-bot:

[operations/deployment-charts@master] mw-debug: add statsd-exporter

https://gerrit.wikimedia.org/r/972342

Change 972343 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: add statsd exporter

https://gerrit.wikimedia.org/r/972343

Change 955015 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable $wgStatsTarget for requests to mwdebug

https://gerrit.wikimedia.org/r/955015

Mentioned in SAL (#wikimedia-operations) [2023-12-13T22:05:49Z] <jhuneidi@deploy2002> Started scap: Backport for [[gerrit:982857|Partially undeploy Reader Demographics 2 survey (T344393)]], [[gerrit:955015|Enable $wgStatsTarget for requests to mwdebug (T240685)]]

Mentioned in SAL (#wikimedia-operations) [2023-12-13T22:07:13Z] <jhuneidi@deploy2002> dani and jhuneidi and cwhite: Backport for [[gerrit:982857|Partially undeploy Reader Demographics 2 survey (T344393)]], [[gerrit:955015|Enable $wgStatsTarget for requests to mwdebug (T240685)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-12-13T22:18:22Z] <jhuneidi@deploy2002> Finished scap: Backport for [[gerrit:982857|Partially undeploy Reader Demographics 2 survey (T344393)]], [[gerrit:955015|Enable $wgStatsTarget for requests to mwdebug (T240685)]] (duration: 12m 33s)

Change 982867 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/mediawiki-config@master] Update wgStatsTarget to port 9125

https://gerrit.wikimedia.org/r/982867

Change 982867 merged by jenkins-bot:

[operations/mediawiki-config@master] Update wgStatsTarget to port 9125

https://gerrit.wikimedia.org/r/982867

Mentioned in SAL (#wikimedia-operations) [2023-12-13T22:48:01Z] <jhuneidi@deploy2002> Started scap: Backport for [[gerrit:982867|Update wgStatsTarget to port 9125 (T240685)]], [[gerrit:982925|[BC] Enable desktop diff and history pages on mobile (T350181 T353388)]]

Mentioned in SAL (#wikimedia-operations) [2023-12-13T22:49:33Z] <jhuneidi@deploy2002> jhuneidi and jdlrobson and cwhite: Backport for [[gerrit:982867|Update wgStatsTarget to port 9125 (T240685)]], [[gerrit:982925|[BC] Enable desktop diff and history pages on mobile (T350181 T353388)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-12-13T22:57:43Z] <jhuneidi@deploy2002> Finished scap: Backport for [[gerrit:982867|Update wgStatsTarget to port 9125 (T240685)]], [[gerrit:982925|[BC] Enable desktop diff and history pages on mobile (T350181 T353388)]] (duration: 09m 42s)