Page MenuHomePhabricator

mediawiki_resourceloader_build_seconds_bucket big metric on Prometheus ops
Open, Needs TriagePublic

Description

With https://gerrit.wikimedia.org/r/q/1204f0de in MediaWiki 1.42/wmf.17 we have started to push what it has become quite a big metric (~2M metrics!): mediawiki_resourceloader_build_seconds_bucket. This is due to an explosion in cardinality by combining extension + buckets + per-host metrics, e.g.

mediawiki_resourceloader_build_seconds_bucket{cluster="api_appserver", instance="mw2261:9112", job="statsd_exporter", le="+Inf", name="user_options", site="codfw"}	1478083
mediawiki_resourceloader_build_seconds_bucket{cluster="api_appserver", instance="mw2261:9112", job="statsd_exporter", le="0.005", name="user_options", site="codfw"}	1477903
mediawiki_resourceloader_build_seconds_bucket{cluster="api_appserver", instance="mw2261:9112", job="statsd_exporter", le="0.01", name="user_options", site="codfw"}	1478057
mediawiki_resourceloader_build_seconds_bucket{cluster="api_appserver", instance="mw2261:9112", job="statsd_exporter", le="0.025", name="user_options", site="codfw"}	1478060
mediawiki_resourceloader_build_seconds_bucket{cluster="api_appserver", instance="mw2261:9112", job="statsd_exporter", le="0.05", name="user_options", site="codfw"}	1478062
mediawiki_resourceloader_build_seconds_bucket{cluster="api_appserver", instance="mw2261:9112", job="statsd_exporter", le="0.1", name="user_options", site="codfw"}	1478083
mediawiki_resourceloader_build_seconds_bucket{cluster="api_appserver", instance="mw2261:9112", job="statsd_exporter", le="0.25", name="user_options", site="codfw"}	1478083
mediawiki_resourceloader_build_seconds_bucket{cluster="api_appserver", instance="mw2261:9112", job="statsd_exporter", le="0.5", name="user_options", site="codfw"}	1478083
mediawiki_resourceloader_build_seconds_bucket{cluster="api_appserver", instance="mw2261:9112", job="statsd_exporter", le="1", name="user_options", site="codfw"}	1478083
mediawiki_resourceloader_build_seconds_bucket{cluster="api_appserver", instance="mw2261:9112", job="statsd_exporter", le="10", name="user_options", site="codfw"}	1478083
mediawiki_resourceloader_build_seconds_bucket{cluster="api_appserver", instance="mw2261:9112", job="statsd_exporter", le="2.5", name="user_options", site="codfw"}	1478083
mediawiki_resourceloader_build_seconds_bucket{cluster="api_appserver", instance="mw2261:9112", job="statsd_exporter", le="30", name="user_options", site="codfw"}	1478083
mediawiki_resourceloader_build_seconds_bucket{cluster="api_appserver", instance="mw2261:9112", job="statsd_exporter", le="5", name="user_options", site="codfw"}	1478083
mediawiki_resourceloader_build_seconds_bucket{cluster="api_appserver", instance="mw2261:9112", job="statsd_exporter", le="60", name="user_options", site="codfw"}

And has resulted in ~30k samples/s additonal load on prometheus/ops

2024-03-08-165146_1464x1782_scrot.png (1×1 px, 76 KB)

I'm not sure right off the bat how to address the issue, though for sure going forward we should be paying extra attention when dealing with histograms in mw since those are easy to make cardinality explode, what do you think @colewhite @herron @DAlangi_WMF (cc @Krinkle since I saw you followed up on the change above)

Event Timeline

Can we disable host-level instance for MediaWiki's statsd exporter? (Or substitute with a constant?) I believe that would save 100x or 2 orders of magnitude. I can't imagine that ever being relevant for service/domain-specific stats from the MediaWiki application. I imagine of the hypothetical use cases that we don't yet have today, 99% would be covered by site="codfw", if we keep that.

At the infrastructure level this precision is certainly valuable, but at some point the application is its own virtual thing, irrespective of which pod or host executed the code.

When things become diagnostic, there's usually a log message involved where more details are available. Plus, if it did become relevant or easier to approach that way at some point in the future, opt-in in is trivial for specific stats via any key-value label, so it wouldn't take away something that is hard to regain when needed.

Can we disable host-level instance for MediaWiki's statsd exporter? (Or substitute with a constant?) I believe that would save 100x or 2 orders of magnitude. I can't imagine that ever being relevant for service/domain-specific stats from the MediaWiki application. I imagine of the hypothetical use cases that we don't yet have today, 99% would be covered by site="codfw", if we keep that.

The short answer is "not with the current architecture"; meaning that we have statsd-exporter on localhost to both increase the general reliability of udp (i.e. statsd clients only write to local udp sockets) and maintainability/serviceability (the graphite host right now receives a figurative flood of udp, to the tune of ~100 megabytes/s)

At the infrastructure level this precision is certainly valuable, but at some point the application is its own virtual thing, irrespective of which pod or host executed the code.

I see what you mean, and it is true that so far we haven't had per-host or per-pod mw metrics with no shortcomings as far as I'm aware (?)

When things become diagnostic, there's usually a log message involved where more details are available. Plus, if it did become relevant or easier to approach that way at some point in the future, opt-in in is trivial for specific stats via any key-value label, so it wouldn't take away something that is hard to regain when needed.

We (o11y) have brainstormed this issue a little at the offsite, and one partial solution would be to get a prometheus dedicated mw instance, to at least contain the blast radius.

We'll have to brainstorm a little more, though even with moderately-sized histograms I can see statsd-exporter per-pod not being manageable when we're talking big histograms and hundreds of pods

We (o11y) have brainstormed this issue a little at the offsite, and one partial solution would be to get a prometheus dedicated mw instance, to at least contain the blast radius.

We'll have to brainstorm a little more, though even with moderately-sized histograms I can see statsd-exporter per-pod not being manageable when we're talking big histograms and hundreds of pods

I had some focus time to think about this a little more, and will braindump below (i.e. none of this is a firm plan)

Since graphite/statsd timings need to be centralized for percentiles to be meaningful, we've grown accustomed to the fact that mw metrics really aren't per host, and I think that's sensible given the circumstances. Therefore keeping the metrics sort of "centralized" is fine; I'm focusing on two aspects:

mw metrics emission

i.e. how mw metrics are generated and emitted and ultimately end up in prometheus format (i.e. statsd-exporter). For this part I think it makes sense to think about a world where mw is fully on k8s already (and we're getting there fast!). We want to effectively reduce the cardinality due to having many instances of statsd-exporter in k8s (as a sidecar right now).

Thus I'm wondering if a "service" to receive mw metrics (basically a bunch of statsd-exporters) would make sense here, of course the service will contain a few replicas and receive load-balanced udp statsd traffic from mw.
Further considerations/open questions include:

  • where should the service be deployed? within wikikube or not?
  • and if wikikube, how many "segments" of this service we want to have, e.g. one per mw "deployment" (please excuse the many quotes here! what I mean here is things like mw-api-int, mw-api-ext, etc). The advantage being that metrics will be tagged automatically by mw deployment
  • if wikikube, would the service mesh be able to route/load balance said udp/statsd traffic?
mw metrics ingestion

This is the part where we ingest mw metrics into Prometheus, and we can tackle this I believe with a mw Prometheus instance, which in turn will only scrape the service(s) above. I'm not 100% convinced yet we need a separate Prometheus instance, also because it would be kind of a snowflake (i.e. needing k8s access to scrape only the service above, if it runs on k8s)

[...]

mw metrics emission

i.e. how mw metrics are generated and emitted and ultimately end up in prometheus format (i.e. statsd-exporter). For this part I think it makes sense to think about a world where mw is fully on k8s already (and we're getting there fast!). We want to effectively reduce the cardinality due to having many instances of statsd-exporter in k8s (as a sidecar right now).

Thus I'm wondering if a "service" to receive mw metrics (basically a bunch of statsd-exporters) would make sense here, of course the service will contain a few replicas and receive load-balanced udp statsd traffic from mw.
Further considerations/open questions include:

  • where should the service be deployed? within wikikube or not?

Since it's UDP, I think it should be deployed within each mediawiki namespace, especially given the cardinality benefits of doing so.

  • and if wikikube, how many "segments" of this service we want to have, e.g. one per mw "deployment" (please excuse the many quotes here! what I mean here is things like mw-api-int, mw-api-ext, etc). The advantage being that metrics will be tagged automatically by mw deployment

I think the right way to do this is, in the mediawiki chart:

  • Define a second deployment named $namespace.$environment.$release-statsd
    • With at least 2 replicas
    • Reusing base.statsd.container, base.statsd.volume and base.statsd.volume for conveninence. This already includes the external port for prometheus to scrape when statsd is enabled in values.yaml.
    • Adding a UDP port reachable from the mediawiki pod's local envoy
  • Add configuration to mediawiki's local envoy to listen on UDP localhost:9125 (matching $wgStatsTarget = 'udp://localhost:9125') and upstream to $namespace.$environment.$release-statsd on the previously defined port.
  • if wikikube, would the service mesh be able to route/load balance said udp/statsd traffic?

Yes, although the load balancing would be done by kubeproxy.

Change #1032795 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Add external statsd-exporter deployment

https://gerrit.wikimedia.org/r/1032795