Page MenuHomePhabricator

MediaWiki Prometheus support
Open, HighPublic

Description

As part of T228380, MediaWiki metrics need to make it into Prometheus somehow. There are a couple ways of doing this.

  1. Statsd Exporter -- This approach was attempted here.
  2. Native support -- This client library might help.

This task is complete when MediaWiki metrics are available in Prometheus.

Event Timeline

My two cents re: prometheus_client_php current adapters, apcu and redis, since AIUI none are optimal/desirable. There might be another way inspired by what the ruby client does (see also this PromCon 2019 talk and video). Very broadly, the idea IIRC is for each process to mmap a separate file with its own metrics, then at collection time metrics are read from the files and merged.

We recently had a conversation about this.

  • There's no clear way to translate from an arbitrary string to a more Prometheus-friendly metric. Usage within MediaWiki is similar to: (StatsBuffer->{increment,gauge,timer}(string name, float|int val)) Example
    • This assumes we ignore setting statsd exporter in front of the output stream and translate each metric from . separated to _ separated. This loses the richness that labels gives us.

@Krinkle added:

a convention like metric_name.labels.go.here isn't enough as you wouldn't want prometheus to make them key_0, key_1 or something like that

  • Code will have to be written. Likely in the form of a formal metrics interface within MediaWiki Core.

@CDanis added:

we probably want the new API to encode some more semantic information somehow? probably, have some notion of what the 'base' metric name is, and what the labels are -- and some default way to sensibly flatten that into a statsd key. I think that makes more sense than starting from a statsd key and figuring out how to split out labels

One idea proposed to have some system find and expose the metrics MediaWiki would generate and dynamically create a statsd exporter config file.

Per @fgiunchedi recommendation, I put together a very basic mockup of how DirectFileStore might look in prometheus_client_php.

As I think about the cross-request persistence problem more using the client library, it's worth documenting my thoughts.

  1. Redis
    1. Pros
      1. Can use client library out of the box.
      2. No need for centralized Redis.
    2. Cons
      1. Have to maintain instance-local Redis instances.
  2. DirectFileStore
    1. Pros
      1. No external dependencies, just a writable directory.
    2. Cons
      1. Invented here.
      2. A situation where the disk fills with many small files could present itself.
  3. APC
    1. Pros:
      1. No external dependencies.
      2. Can use client library out of the box.
    2. Cons:
      1. Creates 2-3 keys per metric tracking the various parts of a Prometheus metric.

All of these have the con of having to implement the client library in the first place.

As I repeatedly reiterated, the big issue here is prometheus has a model (pull) that really doesn't work well with the PHP request management model, which is shared-nothing.

My first requirement for any observability framework does not put availability or performance of the application at risk.

One of the good things of statsd is that it's a fire-and-forget protocol where we just emit an UDP packet, hoping the receiver will get it.

A native prometheus support poses some challenges, in particular when we emit a lot of metrics and we have high concurrency - which is the case for mediawiki.

Let's get into specifics concerns I have for those 3 models:

  1. Redis: What happens if redis is overwhelmed/down? How can we control timeouts? We've already had cases where a slow-responding redis (for log reception) got us in a huge outage. Having localized outages on single servers is not exactly an ideal situation either. Also: how does this library operate? can we concentrate sending data to redis in post-send, so that user-perceived latency is not impacted anyways? Just to make a simple example, say one redis server is being slow (for resources starvation of some kind) and responds to a SET request in 1 ms. If we do send 50 metrics per request to prometheus this would add 50 ms of latency. Clearly unacceptable IMHO
  2. DirectFileStore: I see a handful of problems with this approach - specifically I don't think you'd be able to do what the ruby client does - you can't easily mmap a file from a request and leave it available for other processes. I think it would also be very expensive to open and close a file repeatedly (as all memory and file descriptors are request local
  3. APCu: We make such a large use of APCu that I'm wary of interfering with the normal operations of the wikis, as we've seen scalability problems related to both APC usage and fragmentation. We'd need to see some numbers on the expected rate of writes (basically - how many writes we can expect per MediaWiki request) before we can consider it.

Overall, I think the most promising route of the three is using APCu, but it comes with significant unknowns and risks of being rejected once we test it in production.

Now, AIUI switching from reporting metrics to statsd to reporting metrics to prometheus would require a lot of work in MediaWiki, and I'm worried that we could be doing a lot of work to get ourselves into a dead end.

As I have said before, I'd rather focus on what @Krinkle suggested in the patchset:

Metric reporting from MediaWiki more generally is an area I think that needs much improving and should be much more consistent and streamlined than it is today. Perhaps we could come up with a generic pattern that MediaWiki can control from its side, and we'd gradually migrate thinks towards that.

and do that in a way that allows easy, automatic translation of metrics from the statsd format to the prometheus one.

One alternative is to adopt a sidecar in the form of statsd_exporter and have it do the heavy lifting of translating MediaWiki and MW Extension metrics into Prometheus-compatible format. I see two major pain points with this solution. The first is settling on a pattern of mapping metrics to Prometheus metrics, and second is managing change over time.

To the first, there are a few options.

  1. Map the metrics with one-to-one rules. This has the problem of mapping 44,343 rules (last measured Dec. 2018) to Prometheus metrics. Doing this task with code is onerous because there is no discernable pattern.
  2. Map the metrics with regex matching rules. This was attempted and resulted in a large changeset that was onerous to generate.
  3. Create a "one-size-fits-all" pattern and migrate to that pattern. This would cost a lot for a rigid solution. The primary use of metrics is to give humans a window into what the application or tool is doing. If statsd_exporter imposes a generic and strict format that is too rigid for the problem maintainers want to solve, maintainers will be forced to interface with the second pain point.

To to the second, the configuration has to be stored somewhere and updated when appropriate. There are a few options.

  1. If the configuration remains in the domain of infrastructure, then metrics changes to MediaWiki and MW Extensions create a circular dependency on Puppet changes.
  2. If the configuration moves to the domain of MediaWiki, then it or mediawiki-config will need to manage the mapping config for itself and all installed MW Extensions. For placement in mediawiki-config, this solution assumes simultaneous deployment with MediaWiki and MW Extensions.
  3. If the configuration makes MediaWiki and MW Extensions individually responsible for their statsd_exporter config, then maintainers are forced to do three things:
    1. Update the source code with metrics changes.
    2. Update the statsd_exporter configuration with StatsD->Prometheus mapping rules.
    3. Check to make sure these configuration changes do not conflict with any other mapping rule in MediaWiki and all possible MW Extensions.

On the statsd_exporter side, a little Bash could concatenate these config files together into one config and validate the YAML before startup.

As I think about it more, it's the wire format being wholly incompatible with Prometheus format. In order to make it work, StatsD requires a lot of configuration to adequately convert, and managing that configuration will be burdensome.

If we discard StatsD as a wire format and have MediaWiki emit something more flexible, then the need for configuration can be greatly diminished.

In response to @Joe's concerns:

What happens if redis is overwhelmed/down? How can we control timeouts?

The library has a configurable timeout with a default of 0.1s. When Redis is down, ext-redis cannot establish a connection and throws an exception.

We've already had cases where a slow-responding redis (for log reception) got us in a huge outage. Having localized outages on single servers is not exactly an ideal situation either.

This is an appeal to fear. Many other things cause both local and cluster-wide outages. The risk of using any tool has to be engineered down to an acceptable level. If the risk cannot be mitigated to an appropriate level, other options must be considered.

Also: how does this library operate? can we concentrate sending data to redis in post-send, so that user-perceived latency is not impacted anyways?

This is a great question. As it stands now, no. It would have to be augmented to do this which eliminates the pro for using the client library out of the box. I would even go so far as to say the lack of this feature disqualifies out of the box use in favor of other options.

DirectFileStore: I see a handful of problems with this approach - specifically I don't think you'd be able to do what the ruby client does - you can't easily mmap a file from a request and leave it available for other processes. I think it would also be very expensive to open and close a file repeatedly (as all memory and file descriptors are request local

The ruby client was only inspiration, not a direct clone of the approach. The PHP DirectFileStore implementation opens and closes one file per request. When asked to render the metrics, the per-request files are aggregated together and added to the state file. Writing the per-request metrics file can easily be appended to the end of the request, but rendering the metrics may be a bit slow.

APCu: We make such a large use of APCu that I'm wary of interfering with the normal operations of the wikis, as we've seen scalability problems related to both APC usage and fragmentation. We'd need to see some numbers on the expected rate of writes (basically - how many writes we can expect per MediaWiki request) before we can consider it.

Myself and others share your concern with using APCu, not only for rate of use but way the library uses it doesn't seem to be a great fit. I do not think we should use it for this purpose.

DogStatsD shows some promise here. It's a statsd extension that statsd_exporter supports and enables dynamic labels. In testing, the statsd proxy doesn't support the extension, but translation is trivial if necessary.

If we proceed with DogStatsD, I could see statsd output templated as mediawiki.<extension>(.<labelValues)+.<metric> and Prometheus metrics as mediawiki_<extension>_<name> with labels and values in the appropriate places.

DogStatsD shows some promise here. It's a statsd extension that statsd_exporter supports and enables dynamic labels. In testing, the statsd proxy doesn't support the extension, but translation is trivial if necessary.

If we proceed with DogStatsD, I could see statsd output templated as mediawiki.<extension>(.<labelValues)+.<metric> and Prometheus metrics as mediawiki_<extension>_<name> with labels and values in the appropriate places.

If we want to do it, we will still need to adapt all of the current usage of statsd in MediaWiki and decide on some standard labels to apply across the code.

I think what we need is a precise survey of what MediaWiki emits now, and from where.

Change 585032 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[mediawiki/core@master] Metrics: Implement and enable statsd-exporter compatible Metrics interface

https://gerrit.wikimedia.org/r/585032

lmata added subscribers: AMooney, lmata.

Hi @AMooney, I'd like to present this patch as the other of the two I was hoping to bring to your attention for next clinic duty... Please let me know if/how to proceed. thanks!

AMooney raised the priority of this task from Medium to High.Mar 18 2021, 6:56 PM

@lmata, This needs PET code review correct?