Page MenuHomePhabricator

MediaWiki Prometheus support
Open, HighPublic

Description

As part of T228380, MediaWiki metrics need to make it into Prometheus somehow. There are a couple ways of doing this.

  1. Statsd Exporter -- This approach was attempted here.
  2. Native support -- This client library might help.

This task is complete when MediaWiki metrics are available in Prometheus.

Event Timeline

My two cents re: prometheus_client_php current adapters, apcu and redis, since AIUI none are optimal/desirable. There might be another way inspired by what the ruby client does (see also this PromCon 2019 talk and video). Very broadly, the idea IIRC is for each process to mmap a separate file with its own metrics, then at collection time metrics are read from the files and merged.

We recently had a conversation about this.

  • There's no clear way to translate from an arbitrary string to a more Prometheus-friendly metric. Usage within MediaWiki is similar to: (StatsBuffer->{increment,gauge,timer}(string name, float|int val)) Example
    • This assumes we ignore setting statsd exporter in front of the output stream and translate each metric from . separated to _ separated. This loses the richness that labels gives us.

@Krinkle added:

a convention like metric_name.labels.go.here isn't enough as you wouldn't want prometheus to make them key_0, key_1 or something like that

  • Code will have to be written. Likely in the form of a formal metrics interface within MediaWiki Core.

@CDanis added:

we probably want the new API to encode some more semantic information somehow? probably, have some notion of what the 'base' metric name is, and what the labels are -- and some default way to sensibly flatten that into a statsd key. I think that makes more sense than starting from a statsd key and figuring out how to split out labels

One idea proposed to have some system find and expose the metrics MediaWiki would generate and dynamically create a statsd exporter config file.

Per @fgiunchedi recommendation, I put together a very basic mockup of how DirectFileStore might look in prometheus_client_php.

As I think about the cross-request persistence problem more using the client library, it's worth documenting my thoughts.

  1. Redis
    1. Pros
      1. Can use client library out of the box.
      2. No need for centralized Redis.
    2. Cons
      1. Have to maintain instance-local Redis instances.
  2. DirectFileStore
    1. Pros
      1. No external dependencies, just a writable directory.
    2. Cons
      1. Invented here.
      2. A situation where the disk fills with many small files could present itself.
  3. APC
    1. Pros:
      1. No external dependencies.
      2. Can use client library out of the box.
    2. Cons:
      1. Creates 2-3 keys per metric tracking the various parts of a Prometheus metric.

All of these have the con of having to implement the client library in the first place.

As I repeatedly reiterated, the big issue here is prometheus has a model (pull) that really doesn't work well with the PHP request management model, which is shared-nothing.

My first requirement for any observability framework does not put availability or performance of the application at risk.

One of the good things of statsd is that it's a fire-and-forget protocol where we just emit an UDP packet, hoping the receiver will get it.

A native prometheus support poses some challenges, in particular when we emit a lot of metrics and we have high concurrency - which is the case for mediawiki.

Let's get into specifics concerns I have for those 3 models:

  1. Redis: What happens if redis is overwhelmed/down? How can we control timeouts? We've already had cases where a slow-responding redis (for log reception) got us in a huge outage. Having localized outages on single servers is not exactly an ideal situation either. Also: how does this library operate? can we concentrate sending data to redis in post-send, so that user-perceived latency is not impacted anyways? Just to make a simple example, say one redis server is being slow (for resources starvation of some kind) and responds to a SET request in 1 ms. If we do send 50 metrics per request to prometheus this would add 50 ms of latency. Clearly unacceptable IMHO
  2. DirectFileStore: I see a handful of problems with this approach - specifically I don't think you'd be able to do what the ruby client does - you can't easily mmap a file from a request and leave it available for other processes. I think it would also be very expensive to open and close a file repeatedly (as all memory and file descriptors are request local
  3. APCu: We make such a large use of APCu that I'm wary of interfering with the normal operations of the wikis, as we've seen scalability problems related to both APC usage and fragmentation. We'd need to see some numbers on the expected rate of writes (basically - how many writes we can expect per MediaWiki request) before we can consider it.

Overall, I think the most promising route of the three is using APCu, but it comes with significant unknowns and risks of being rejected once we test it in production.

Now, AIUI switching from reporting metrics to statsd to reporting metrics to prometheus would require a lot of work in MediaWiki, and I'm worried that we could be doing a lot of work to get ourselves into a dead end.

As I have said before, I'd rather focus on what @Krinkle suggested in the patchset:

Metric reporting from MediaWiki more generally is an area I think that needs much improving and should be much more consistent and streamlined than it is today. Perhaps we could come up with a generic pattern that MediaWiki can control from its side, and we'd gradually migrate thinks towards that.

and do that in a way that allows easy, automatic translation of metrics from the statsd format to the prometheus one.

One alternative is to adopt a sidecar in the form of statsd_exporter and have it do the heavy lifting of translating MediaWiki and MW Extension metrics into Prometheus-compatible format. I see two major pain points with this solution. The first is settling on a pattern of mapping metrics to Prometheus metrics, and second is managing change over time.

To the first, there are a few options.

  1. Map the metrics with one-to-one rules. This has the problem of mapping 44,343 rules (last measured Dec. 2018) to Prometheus metrics. Doing this task with code is onerous because there is no discernable pattern.
  2. Map the metrics with regex matching rules. This was attempted and resulted in a large changeset that was onerous to generate.
  3. Create a "one-size-fits-all" pattern and migrate to that pattern. This would cost a lot for a rigid solution. The primary use of metrics is to give humans a window into what the application or tool is doing. If statsd_exporter imposes a generic and strict format that is too rigid for the problem maintainers want to solve, maintainers will be forced to interface with the second pain point.

To to the second, the configuration has to be stored somewhere and updated when appropriate. There are a few options.

  1. If the configuration remains in the domain of infrastructure, then metrics changes to MediaWiki and MW Extensions create a circular dependency on Puppet changes.
  2. If the configuration moves to the domain of MediaWiki, then it or mediawiki-config will need to manage the mapping config for itself and all installed MW Extensions. For placement in mediawiki-config, this solution assumes simultaneous deployment with MediaWiki and MW Extensions.
  3. If the configuration makes MediaWiki and MW Extensions individually responsible for their statsd_exporter config, then maintainers are forced to do three things:
    1. Update the source code with metrics changes.
    2. Update the statsd_exporter configuration with StatsD->Prometheus mapping rules.
    3. Check to make sure these configuration changes do not conflict with any other mapping rule in MediaWiki and all possible MW Extensions.

On the statsd_exporter side, a little Bash could concatenate these config files together into one config and validate the YAML before startup.

As I think about it more, it's the wire format being wholly incompatible with Prometheus format. In order to make it work, StatsD requires a lot of configuration to adequately convert, and managing that configuration will be burdensome.

If we discard StatsD as a wire format and have MediaWiki emit something more flexible, then the need for configuration can be greatly diminished.

In response to @Joe's concerns:

What happens if redis is overwhelmed/down? How can we control timeouts?

The library has a configurable timeout with a default of 0.1s. When Redis is down, ext-redis cannot establish a connection and throws an exception.

We've already had cases where a slow-responding redis (for log reception) got us in a huge outage. Having localized outages on single servers is not exactly an ideal situation either.

This is an appeal to fear. Many other things cause both local and cluster-wide outages. The risk of using any tool has to be engineered down to an acceptable level. If the risk cannot be mitigated to an appropriate level, other options must be considered.

Also: how does this library operate? can we concentrate sending data to redis in post-send, so that user-perceived latency is not impacted anyways?

This is a great question. As it stands now, no. It would have to be augmented to do this which eliminates the pro for using the client library out of the box. I would even go so far as to say the lack of this feature disqualifies out of the box use in favor of other options.

DirectFileStore: I see a handful of problems with this approach - specifically I don't think you'd be able to do what the ruby client does - you can't easily mmap a file from a request and leave it available for other processes. I think it would also be very expensive to open and close a file repeatedly (as all memory and file descriptors are request local

The ruby client was only inspiration, not a direct clone of the approach. The PHP DirectFileStore implementation opens and closes one file per request. When asked to render the metrics, the per-request files are aggregated together and added to the state file. Writing the per-request metrics file can easily be appended to the end of the request, but rendering the metrics may be a bit slow.

APCu: We make such a large use of APCu that I'm wary of interfering with the normal operations of the wikis, as we've seen scalability problems related to both APC usage and fragmentation. We'd need to see some numbers on the expected rate of writes (basically - how many writes we can expect per MediaWiki request) before we can consider it.

Myself and others share your concern with using APCu, not only for rate of use but way the library uses it doesn't seem to be a great fit. I do not think we should use it for this purpose.

DogStatsD shows some promise here. It's a statsd extension that statsd_exporter supports and enables dynamic labels. In testing, the statsd proxy doesn't support the extension, but translation is trivial if necessary.

If we proceed with DogStatsD, I could see statsd output templated as mediawiki.<extension>(.<labelValues)+.<metric> and Prometheus metrics as mediawiki_<extension>_<name> with labels and values in the appropriate places.

DogStatsD shows some promise here. It's a statsd extension that statsd_exporter supports and enables dynamic labels. In testing, the statsd proxy doesn't support the extension, but translation is trivial if necessary.

If we proceed with DogStatsD, I could see statsd output templated as mediawiki.<extension>(.<labelValues)+.<metric> and Prometheus metrics as mediawiki_<extension>_<name> with labels and values in the appropriate places.

If we want to do it, we will still need to adapt all of the current usage of statsd in MediaWiki and decide on some standard labels to apply across the code.

I think what we need is a precise survey of what MediaWiki emits now, and from where.

Change 585032 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[mediawiki/core@master] Metrics: Implement and enable statsd-exporter compatible Metrics interface

https://gerrit.wikimedia.org/r/585032

lmata added subscribers: AMooney, lmata.

Hi @AMooney, I'd like to present this patch as the other of the two I was hoping to bring to your attention for next clinic duty... Please let me know if/how to proceed. thanks!

AMooney raised the priority of this task from Medium to High.Mar 18 2021, 6:56 PM

@lmata, This needs PET code review correct?

We would consider using this in some up coming work if it gets merged

Change 721626 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: Implement statsd-exporter compatible Metrics interface

https://gerrit.wikimedia.org/r/721626

Change 721627 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: Add metrics configuration options

https://gerrit.wikimedia.org/r/721627

Change 721628 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: Wire up MetricsFactory into ServiceWiring

https://gerrit.wikimedia.org/r/721628

Change 721629 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: Perform MetricsFactory->flush() in emitBufferedStatsdData()

https://gerrit.wikimedia.org/r/721629

Change 721630 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: send MetricsFactory to emit step

https://gerrit.wikimedia.org/r/721630

Change 724129 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] pass MetricsFactory instance to emitBufferedStatsdData in MWLBFactory

https://gerrit.wikimedia.org/r/724129

Change 721626 merged by jenkins-bot:

[mediawiki/core@master] Metrics: Implement statsd-exporter compatible Metrics interface

https://gerrit.wikimedia.org/r/721626

Change 585032 abandoned by Cwhite:

[mediawiki/core@master] Metrics: Implement and enable statsd-exporter compatible Metrics interface

Reason:

in favor of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/721626/

https://gerrit.wikimedia.org/r/585032

Change 721626 merged by jenkins-bot:

[mediawiki/core@master] Metrics: Implement statsd-exporter compatible Metrics interface

https://gerrit.wikimedia.org/r/721626

A few things todo given this was merged and thus will be deployed and in a stable release:

  • Add or update entry for this feature in https://www.mediawiki.org/wiki/Developers/Maintainers#MediaWiki_core, with a non-empty steward team, and ideally one or two named individuals who know the code well.
  • Ensure the entry links to a documetnation page (or stub).
  • Ensure the entry has a Phabricator workboard for issues to go to (or create a new one). A list of all core phab tags exists at MediaWiki-General.
  • Ideally tasks for the chosen tag automatically arrive in the triage process of at least one engineering team, e.g. through a Herald rule so that security issues and production errors are swiftly discovered without relying on invididuals to connect the dots.

Change 721629 abandoned by Cwhite:

[mediawiki/core@master] Metrics: Perform MetricsFactory->flush() in emitBufferedStatsdData()

Reason:

in favor of I46f0a09f4dab38fa4c9495aa2da9ecab60376ca7

https://gerrit.wikimedia.org/r/721629

Change 721628 abandoned by Cwhite:

[mediawiki/core@master] Metrics: Wire up MetricsFactory into ServiceWiring

Reason:

in favor of I46f0a09f4dab38fa4c9495aa2da9ecab60376ca7

https://gerrit.wikimedia.org/r/721628

Change 725981 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Metrics: reduce the number of packets sent by MetricsFactory->flush()

https://gerrit.wikimedia.org/r/725981

Change 721627 merged by jenkins-bot:

[mediawiki/core@master] Metrics: Wire up MetricsFactory into ServiceWiring and emit steps

https://gerrit.wikimedia.org/r/721627

Change 725981 merged by jenkins-bot:

[mediawiki/core@master] Metrics: reduce the number of packets sent by MetricsFactory->flush()

https://gerrit.wikimedia.org/r/725981

Change 732827 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] profile: add enable_relay flag to statsd exporter profile

https://gerrit.wikimedia.org/r/732827

Change 732828 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] role: install statsd_exporter on canary appservers

https://gerrit.wikimedia.org/r/732828

Change 733036 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] Metrics: Add test coverage for getRenderedSamples() and send()

https://gerrit.wikimedia.org/r/733036

Change 732827 merged by Cwhite:

[operations/puppet@production] profile: add enable_relay flag to statsd exporter profile

https://gerrit.wikimedia.org/r/732827

Change 733036 merged by jenkins-bot:

[mediawiki/core@master] Metrics: Add test coverage for getRenderedSamples() and send()

https://gerrit.wikimedia.org/r/733036

Change 725372 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] Metrics: Minor doc improvements and announce feature in release notes

https://gerrit.wikimedia.org/r/725372

Change 725372 merged by jenkins-bot:

[mediawiki/core@master] Metrics: Minor doc improvements and announce feature in release notes

https://gerrit.wikimedia.org/r/725372

Just trying to figure out the status of this feature from the comments.
Would anyone in the know be able to write a summary?

Would anyone in the know be able to write a summary?

Yes!

As I understand it, the next steps are:

  1. Configure and deploy StatsD Exporter on MediaWiki instances.
    1. Need direction on where to start. Testwiki? Beta? Canaries?
  2. Configure Prometheus to pull metrics from new StatsD Exporter instances.
  3. Update MediaWiki config to enable the library.
  4. Begin updating metrics calls to use the new library.
    1. Where to start? Maintenance jobs? MW-core?
  5. As needed: update dashboards.

I'm having trouble understanding what DogStatsD does and how it fits in. It is described on the linked webpage as "The easiest way to get your custom application metrics into Datadog" which is not a problem that I thought we had. Is it open source? Where is the source of it? Do they support this kind of use case? How much value does it provide compared to a DIY solution?

I reviewed the interface, and I have thoughts.

Why do names have dots on the wire, e.g. MediaWiki.SomeExtension.name_of_metric? Does DogStatsD convert the dots to underscores for Prometheus? If so, what is the point of preventing dots in MetricUtils::validateConfig()? The name policy is apparently the only thing preventing the old interface from becoming a backwards compatible wrapper around the new interface. I know it's not necessary with your migration plan, but allowing it would open up some more options for migration.

The old interface has getPerDbNameStatsdDataFactory(), and the new interface requires you to provide an associative array with "extension" provided every time a metric is updated. Both of these imply that the migrated call site is going to be verbose in the new interface and that wrappers will be required.

The term "extension" is awkward because there are callers that aren't extensions. In wfDeprecated() we use the term "component" which might be a better fit here.

I think it overuses associative arrays. Formal parameters can carry types both in static analysis and at runtime, and have better IDE integration. The flexibility advantage of associative arrays will mostly disappear in PHP 8.0 with the introduction of named parameters.

TimingMetric could be improved by having it actually measure time. Most callers don't want to be fiddling with microtime(true).

I think I would add a MediaWiki-specific stateful factory which would carry name components and labels, for the caller's convenience.

Something like

$serviceWiring = [
    'MyExtension.MetricFactory' => static function ( $services ) {
         // chainable + immutable, mutator-like methods clone the object
        return $services->getMetricFactory()->extension( 'MyExtension' );
    }
];

namespace MediaWiki\Extension\MyExtension;

class SpecialThing {
    public function __construct( $myExtensionMetricFactory ) {
        $this->metricFactory = $myExtensionMetricFactory
            ->wiki( /* default = current wiki */ )
            ->name( 'Thing' )
            ->label( 'color', 'green' );
    }

    public function edit() {
        $timer = $this->metricFactory->startTimer( 'edit' );
        ...
        // emit statsd metric MediaWiki.$wiki.MyExtension.Thing.green.edit
        // or prometheus metric MediaWiki_MyExtension_Thing_edit{wiki=$wiki, color=green}
        $timer->stop();
    }
}

Hey, thanks for following up! These are great questions.

I'm having trouble understanding what DogStatsD does and how it fits in. It is described on the linked webpage as "The easiest way to get your custom application metrics into Datadog" which is not a problem that I thought we had.

You are correct, integration with Datadog is not the end goal. DogStatsD has the benefit of looking and behaving like StatsD on the wire, but extends the StatsD schema to include tagging which turn into Prometheus labels. This buys us the ability to deploy a statsd_exporter sidecar with no custom parsing configuration and we get Prometheus metrics out of it. We explored the possibility of configuring the statsd_exporter sidecar parsing the currently-generated metrics, but found the configuration quite lengthy and unsustainable as metrics are added/removed/changed.

Is it open source? Where is the source of it? Do they support this kind of use case?

DogStatsD is a datagram format that extends StatsD with tagging. References to code seem to be the datadog client which is Apache2 licensed.

How much value does it provide compared to a DIY solution?

This answer is fairly lengthy. In short, we already use statsd_exporter elsewhere, it understands dogstatsd, and reliably converts these datagram packets into Prometheus metrics.

I reviewed the interface, and I have thoughts.

Why do names have dots on the wire, e.g. MediaWiki.SomeExtension.name_of_metric? Does DogStatsD convert the dots to underscores for Prometheus? If so, what is the point of preventing dots in MetricUtils::validateConfig()?

Yes, the dots are converted to underscores by statsd_exporter. Dots are not supported as Prometheus metric or label names. This validation step is to maintain label integrity when StatsD is rendered:

# Prometheus
MediaWiki_SomeExtension_name_of_metric{label_1="value_1", label_2="value_2"}
# StatsD
MediaWiki.SomeExtension.name_of_metric.label_1.value_1.label_2.value_2

Without maintaining name and label integrity, it is possible these metrics could be generated by the same call:

# Prometheus
MediaWiki_SomeExtension_name_of_metric{label_1="value_1", label_2="value_2"}
# StatsD
MediaWiki.SomeExtension.name.of.metric.label.1.value.1.label.2.value.2

The name policy is apparently the only thing preventing the old interface from becoming a backwards compatible wrapper around the new interface. I know it's not necessary with your migration plan, but allowing it would open up some more options for migration.

The rationale for the metric library behavior is that Prometheus needs separation between metric name and labels where StatsD has no such need. The new library collects the data in a way Prometheus expects it and tries its best to digest it into a bare StatsD-compatible format. I think it's possible to provide an override setting that would allow users to set arbitrary metric namespaces to only be used by StatsD, but it feels fairly hacky.

The old interface has getPerDbNameStatsdDataFactory(), and the new interface requires you to provide an associative array with "extension" provided every time a metric is updated. Both of these imply that the migrated call site is going to be verbose in the new interface and that wrappers will be required.

Oof. Thanks for pointing this out. I was unaware of this abstraction. Seems like we would have to create a service that mirrors this pattern since it is config data being incorporated by ServiceWiring.

The term "extension" is awkward because there are callers that aren't extensions. In wfDeprecated() we use the term "component" which might be a better fit here.

This is a determination I think a core dev should make. The whole point of "extension" was to enforce uniqueness early on in the metric names to stem the possibility of naming conflicts. Metric naming conflicts are still possible in both formats due to the lack of a namespace validation feedback loop.

TimingMetric could be improved by having it actually measure time. Most callers don't want to be fiddling with microtime(true).

Good point! This seems like it would be an easy feature to add.

I think it overuses associative arrays. Formal parameters can carry types both in static analysis and at runtime, and have better IDE integration. The flexibility advantage of associative arrays will mostly disappear in PHP 8.0 with the introduction of named parameters.

I think I would add a MediaWiki-specific stateful factory which would carry name components and labels, for the caller's convenience.

Something like

$serviceWiring = [
    'MyExtension.MetricFactory' => static function ( $services ) {
         // chainable + immutable, mutator-like methods clone the object
        return $services->getMetricFactory()->extension( 'MyExtension' );
    }
];

namespace MediaWiki\Extension\MyExtension;

class SpecialThing {
    public function __construct( $myExtensionMetricFactory ) {
        $this->metricFactory = $myExtensionMetricFactory
            ->wiki( /* default = current wiki */ )
            ->name( 'Thing' )
            ->label( 'color', 'green' );
    }

    public function edit() {
        $timer = $this->metricFactory->startTimer( 'edit' );
        ...
        // emit statsd metric MediaWiki.$wiki.MyExtension.Thing.green.edit
        // or prometheus metric MediaWiki_MyExtension_Thing_edit{wiki=$wiki, color=green}
        $timer->stop();
    }
}

The overuse of associative arrays were an attempt to keep us from getting painted into a corner. I am very much interested in ways to improve the interface.

TIL extensions can register services. I like this.

From @tstarling's comment, I see a few action items:

  1. Create a service that can be a drop-in replacement for getPerDbNameStatsdDataFactory().
  2. Add timing helper functions so that users do not need to supply their own timestamps to TimingMetric.
  3. Explore the possibility of Builder pattern to reduce dependence on associative arrays and enable extensions to register their own services with prepopulated settings and base labels.

Questions to answer:

  1. Do we want to explore the option of an override parameter so that the new metrics library can emit StatsD metrics at the old namespace name?
  2. Should "extension" as a uniqueness constraint be renamed to "component"?

From @tstarling's comment, I see a few action items:

  1. Create a service that can be a drop-in replacement for getPerDbNameStatsdDataFactory().

I think more specifically services->getStatsdDataFactory(): BufferingStatsdDataFactory which is what the vast majority of MW production code uses. The per-dbname wrapper is fairly rarely used in practice (only Wikibase and Growth experiment). Supporting the per-dbname wrapper would be nice, but imho less urgent.