Page MenuHomePhabricator

Create a per-release deployment of statsd-exporter for mw-on-k8s
Open, In Progress, HighPublic

Description

Define a second deployment named $namespace.$environment.$release-statsd

  • With at least 2 replicas
  • Reusing base.statsd.container, base.statsd.volume and base.statsd.configmap for convenience. This already includes the external port for prometheus to scrape when statsd is enabled in values.yaml.
  • Adding a UDP port reachable from the mediawiki pod's local envoy

Add configuration to mediawiki's local envoy to listen on UDP localhost:9125 (matching $wgStatsTarget = 'udp://localhost:9125') and upstream to $namespace.$environment.$release-statsd on the previously defined port.

Event Timeline

Clement_Goubert changed the task status from Open to In Progress.May 17 2024, 3:19 PM
Clement_Goubert triaged this task as High priority.
Clement_Goubert created this task.
Clement_Goubert moved this task from Incoming đŸ« to Doing 😎 on the serviceops board.

Change #1032795 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Add external statsd-exporter deployment

https://gerrit.wikimedia.org/r/1032795

I might be missing something obvious here, but I've two questions:

  • Why add the statsd deployment to the mediawiki chart instead of using a statsd chart, adding a statsd release to mediawiki helmfile.yaml's?
  • Why do we need to tunnel statsd trough the local envoy? Can't mediawiki use $namespace.$environment.$release-statsd directly?

I can't authoritatively answer the questions though IIRC from my chat with @Clement_Goubert on using mesh/envoy vs not it was for symmetry with the rest of mw. To be clear: I don't feel strongly either way, whichever is best practice in this case works for me (ditto for the first question FWIW)

I might be missing something obvious here, but I've two questions:

  • Why add the statsd deployment to the mediawiki chart instead of using a statsd chart, adding a statsd release to mediawiki helmfile.yaml's?

Because I threw that proposal and associated WIP patch together in an afternoon, and didn't think about this solution. It would probably be cleaner to do it this way, yes.

  • Why do we need to tunnel statsd trough the local envoy? Can't mediawiki use $namespace.$environment.$release-statsd directly?

As @fgiunchedi said, this seemed quicker as it removed the need to change mediawiki-config as well as the image and php-fpm config. You're right we can re-use the pattern that is used for the memcached daemonset of passing the statsd address through an environment variable to php-fpm and catch it in mediawiki-config.

I'd like a little more in put on this before proceeding though, @akosiaris what do you think?

Thanks for your help!

As we discussed on IRC, namespace isn't the right term. This came from a fundamental misunderstanding about how MW versions are handled. Given that we live in a MWMultiVersion world, we need to have MediaWiki direct the metrics traffic to the appropriate statsd-exporter instance based on group.

The core problem is when a metric signature changes due to code change, statsd-exporter will drop metrics for a metric name provided a mismatched signature resident in memory (metric signature = name + label keys).

In order to support multiple versions in production and not drop metrics due to a race condition, we need statsd-exporter instances per group.

In addition, the deployment tool must restart the group's statsd-exporter instance to clear any previous metric instances resident in memory. T359497

Hmm, this complicates things.

We have 2 hypotheses in next years APP to see if we can start replacing multiversion mediawiki with something differently implemented that achieves a similar end goal (Wikipedia's being hit with new technical changes on Thusdays) . It's something that has grown organically to match our older infrastructure and it no longer makes that much sense in our newer infrastructure.

A few thoughts on this:

  1. I think using daemonsets is a better option than a deployment. It will ensure UDP fire-and-forget communication (which is in itself scary in k8s with all the layers of indirection) happens locally to the physical node. For the pods to connect to it, we'll just have to pass down the node ip as an env variable to php-fpm, using the downward api to pass status.hostIP.
  2. This has the added advantage of reproducing the load pattern we have on bare metal, of course
  3. As for one release per MediaWiki version, it seems to me like it could be pretty wasteful: almost 90% of the metrics traffic is from group2, for instance. We'll need to run some maths on the amount of resources we'd need, but I think we should explore alternatives, specifically:
    1. Look at contributing to upstream statsd-exporter to add a switch allowing to not discard metrics with new signatures, given prometheus won't complain about it AIUI
    2. Institute development policies that will prevent the problem from happening regularly
  4. Is there any good reason why we shouldn't set a short-ish TTL on metrics in statsd-exporter? I don't think restarting it with every deployment is really a great idea, I'd rather spend a bit of cpu more and ensure we don't lose metrics for more than a minute during deployments at worst.

A few thoughts on this:

  1. I think using daemonsets is a better option than a deployment. It will ensure UDP fire-and-forget communication (which is in itself scary in k8s with all the layers of indirection) happens locally to the physical node.

Just to note that we don't need this if we want to keep the current status quo. We are in a fire and forget situation right now. Everything UDP fires-and-forgets to statsd.eqiad.wmnet (with hardcoded IP to avoid DNS requests). If we want to increase the reliability of that pathway, we should at least have a conversation as to why that is needed (e.g. are we experiencing issues with missing metrics currently?)

For the pods to connect to it, we'll just have to pass down the node ip as an env variable to php-fpm, using the downward api to pass status.hostIP.

This will require the pods to use hostNetwork: true which isn't a good idea. The only cases where hostNetwork: true makes sense are CNIs and generally workloads that need to mess with the host's networking configuration. e.g. The only ones I see in deployment-charts are calico stuff. There is support in a few other charts (e.g. cert-manager, jaeger, etc) but it is disabled in our infra. Furthermore, our PSP (and I assume our newest PSS) won't allow that anywhere else than kube-system (and that's a very good idea).

On a more general note, if one ever finds themselves turning to the downward API for communication reasons, they are reaching for the wrong tool in their toolbet.

There is an alternative and it's the one we went for with the mcrouter daemonset. A Service with InternalTrafficPolicy: local. We can even hardcode the IP address to avoid DNS requests.

  1. This has the added advantage of reproducing the load pattern we have on bare metal, of course

Indeed, but I 'll point out that Grafana exploring the metrics currently emitted via statsd-exporter on our current baremetal is proving to be difficult. And we are talking about a 1:6 ratio (legacy vs wikikube hardware nodes) currently? I am unsure this will actually manage to scale?

  1. As for one release per MediaWiki version, it seems to me like it could be pretty wasteful: almost 90% of the metrics traffic is from group2, for instance. We'll need to run some maths on the amount of resources we'd need, but I think we should explore alternatives, specifically:
    1. Look at contributing to upstream statsd-exporter to add a switch allowing to not discard metrics with new signatures, given prometheus won't complain about it AIUI
    2. Institute development policies that will prevent the problem from happening regularly

Agreed on this one. Correct me if I am wrong, but we don't currently (statsd.eqiad.wmnet) differentiate by group, I 'd rather we didn't start to differentiate because of some behavior of statsd-exporter.

A few thoughts on this:

  1. I think using daemonsets is a better option than a deployment. It will ensure UDP fire-and-forget communication (which is in itself scary in k8s with all the layers of indirection) happens locally to the physical node.

Just to note that we don't need this if we want to keep the current status quo. We are in a fire and forget situation right now. Everything UDP fires-and-forgets to statsd.eqiad.wmnet (with hardcoded IP to avoid DNS requests). If we want to increase the reliability of that pathway, we should at least have a conversation as to why that is needed (e.g. are we experiencing issues with missing metrics currently?)

I am still unclear on when metrics are sent directly to statsd.eqiad.wmnet or to the prometheus-statsd-exporter through envoy. Is this a transitional state towards using prometheus to collect metrics and both are required for now, or are some metrics sent through one path and others through the second? I think the scope of this task is to solve the second path only (MediaWiki -> statsd-exporter -> prometheus).

A few thoughts on this:

  1. I think using daemonsets is a better option than a deployment. It will ensure UDP fire-and-forget communication (which is in itself scary in k8s with all the layers of indirection) happens locally to the physical node.

Just to note that we don't need this if we want to keep the current status quo. We are in a fire and forget situation right now. Everything UDP fires-and-forgets to statsd.eqiad.wmnet (with hardcoded IP to avoid DNS requests). If we want to increase the reliability of that pathway, we should at least have a conversation as to why that is needed (e.g. are we experiencing issues with missing metrics currently?)

I am still unclear on when metrics are sent directly to statsd.eqiad.wmnet or to the prometheus-statsd-exporter through envoy. Is this a transitional state towards using prometheus to collect metrics and both are required for now, or are some metrics sent through one path and others through the second? I think the scope of this task is to solve the second path only (MediaWiki -> statsd-exporter -> prometheus).

MW metrics are being sent to both right now. Using different code paths (and configuration) of course. There's probably discrepancies between the 2 sets of metrics of course, because we are using different protocols, e.g. in prometheus we have the histograms there, in graphite (statsd.eqiad.wmnet) we got the timings with different functionalities.

The task is indeed about the second path only, but it's always useful to remember what it is we are migrating away from. I think it will help us to avoid over engineering and solving problems that might not be there or matter much.

A few thoughts on this:

  1. I think using daemonsets is a better option than a deployment. It will ensure UDP fire-and-forget communication (which is in itself scary in k8s with all the layers of indirection) happens locally to the physical node.

Just to note that we don't need this if we want to keep the current status quo. We are in a fire and forget situation right now. Everything UDP fires-and-forgets to statsd.eqiad.wmnet (with hardcoded IP to avoid DNS requests). If we want to increase the reliability of that pathway, we should at least have a conversation as to why that is needed (e.g. are we experiencing issues with missing metrics currently?)

Just to go on record, we did some basic napkin math with @fgiunchedi and we can't really support this number of instances given the sheer cardinality of the mediawiki metrics. So we will need to create a deployment, as proposed previously.

Indeed, but I 'll point out that Grafana exploring the metrics currently emitted via statsd-exporter on our current baremetal is proving to be difficult. And we are talking about a 1:6 ratio (legacy vs wikikube hardware nodes) currently? I am unsure this will actually manage to scale?

See above, indeed it won't :)

Agreed on this one. Correct me if I am wrong, but we don't currently (statsd.eqiad.wmnet) differentiate by group, I 'd rather we didn't start to differentiate because of some behavior of statsd-exporter.

We have quite a few problems around this, I'll write more extensively about it below.

After a lot of back and forth between solutions all having downsides, I think a reasonable solution we can use is, as long as we're only interested in aggregates of metrics and know single instance values are meaningless:

  1. Create a simple statsd-exporter chart that deploys statsd exporter with a ClusterIP service
  2. Add a deployment of this chart to each mediawiki namespace. I would probably go with a release in the main helmfile, but maybe a separate helmfile is better
  3. Make mediawiki connect to the statsd-exporter service via the ClusterIP

A few thoughts on this:

  1. I think using daemonsets is a better option than a deployment. It will ensure UDP fire-and-forget communication (which is in itself scary in k8s with all the layers of indirection) happens locally to the physical node. For the pods to connect to it, we'll just have to pass down the node ip as an env variable to php-fpm, using the downward api to pass status.hostIP.
  2. This has the added advantage of reproducing the load pattern we have on bare metal, of course
  3. As for one release per MediaWiki version, it seems to me like it could be pretty wasteful: almost 90% of the metrics traffic is from group2, for instance. We'll need to run some maths on the amount of resources we'd need, but I think we should explore alternatives, specifically:
    1. Look at contributing to upstream statsd-exporter to add a switch allowing to not discard metrics with new signatures, given prometheus won't complain about it AIUI

I've looked into this and indeed Prometheus will ingest metrics with inconsistent sets of labels, though that's quite discouraged and I had to basically serve a plaintext file. AFAICT what happens instead with the golang prometheus client (e.g. node-exporter) is that missing labels will get an empty value, e.g. this in a .prom file

test_foo{bar="baz"} 1
test_foo{bar="baz", meh="baz"} 1

Will result in this being served by node-exporter:

test_foo{bar="baz",meh=""} 1
test_foo{bar="baz", meh="baz"} 1
    1. Institute development policies that will prevent the problem from happening regularly
  1. Is there any good reason why we shouldn't set a short-ish TTL on metrics in statsd-exporter? I don't think restarting it with every deployment is really a great idea, I'd rather spend a bit of cpu more and ensure we don't lose metrics for more than a minute during deployments at worst.

To be clear I was under the impression that "restart" in this context meant k8s' idea of a restart, i.e. progressively recycling pods so there's always capacity available.

After a lot of back and forth between solutions all having downsides, I think a reasonable solution we can use is, as long as we're only interested in aggregates of metrics and know single instance values are meaningless:

  1. Create a simple statsd-exporter chart that deploys statsd exporter with a ClusterIP service
  2. Add a deployment of this chart to each mediawiki namespace. I would probably go with a release in the main helmfile, but maybe a separate helmfile is better
  3. Make mediawiki connect to the statsd-exporter service via the ClusterIP

Given the solution space we've explored this sounds like a viable option to me! To the point of "single instance values are meaningless" this is already true with statsd since we're aggregating centrally; in other words we're already used/expecting mw stats to be meaningless on a per-host basis

  1. As for one release per MediaWiki version, it seems to me like it could be pretty wasteful: almost 90% of the metrics traffic is from group2, for instance. We'll need to run some maths on the amount of resources we'd need, but I think we should explore alternatives, specifically:
    1. Look at contributing to upstream statsd-exporter to add a switch allowing to not discard metrics with new signatures, given prometheus won't complain about it AIUI

The need for a per-group statsd-exporter is necessary only if we continue to use statsd-exporter v0.9.0. Upstream enabled the ability to have inconsistent label sets in v0.10.2 I suggest we upgrade statsd-exporter and eliminate the need for per-group exporters: T302373: Upgrade prometheus-statsd-exporter

Please express your support for a TTL or other solution on T359497.

  1. Institute development policies that will prevent the problem from happening regularly

I feel that generating development policies around this should be avoided if possible.

Change #1039171 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] Add new chart statsd-exporter

https://gerrit.wikimedia.org/r/1039171

  1. As for one release per MediaWiki version, it seems to me like it could be pretty wasteful: almost 90% of the metrics traffic is from group2, for instance. We'll need to run some maths on the amount of resources we'd need, but I think we should explore alternatives, specifically:
    1. Look at contributing to upstream statsd-exporter to add a switch allowing to not discard metrics with new signatures, given prometheus won't complain about it AIUI

The need for a per-group statsd-exporter is necessary only if we continue to use statsd-exporter v0.9.0. Upstream enabled the ability to have inconsistent label sets in v0.10.2 I suggest we upgrade statsd-exporter and eliminate the need for per-group exporters: T302373: Upgrade prometheus-statsd-exporter

This is great! I'll be resuming the work on the upgrade

Change #1039233 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] statsd: add deployment to mw-debug (codfw only)

https://gerrit.wikimedia.org/r/1039233

Change #1039234 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mw-debug: add statsd service everywhere

https://gerrit.wikimedia.org/r/1039234

Change #1032795 abandoned by Clément Goubert:

[operations/deployment-charts@master] mediawiki: Add external statsd-exporter deployment

Reason:

Abandoned in favor of I7f589fa1e3ace14cce4b8df74d5adae3c3ccb68b

https://gerrit.wikimedia.org/r/1032795

Change #1039779 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mw-debug: start using php.envvars, expose statsd-exporter

https://gerrit.wikimedia.org/r/1039779

Change #1039171 merged by jenkins-bot:

[operations/deployment-charts@master] Add new chart statsd-exporter

https://gerrit.wikimedia.org/r/1039171

Change #1039233 merged by jenkins-bot:

[operations/deployment-charts@master] statsd: add deployment to mw-debug (codfw only)

https://gerrit.wikimedia.org/r/1039233

Change #1039234 merged by Giuseppe Lavagetto:

[operations/deployment-charts@master] mw-debug: add statsd service everywhere

https://gerrit.wikimedia.org/r/1039234

Change #1039779 merged by Giuseppe Lavagetto:

[operations/deployment-charts@master] mw-debug: start using php.envvars, expose statsd-exporter

https://gerrit.wikimedia.org/r/1039779

Change #1041656 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/mediawiki-config@master] Use the statsd-exporter service where available

https://gerrit.wikimedia.org/r/1041656

Change #1041656 merged by jenkins-bot:

[operations/mediawiki-config@master] Use the statsd-exporter service where available

https://gerrit.wikimedia.org/r/1041656

Mentioned in SAL (#wikimedia-operations) [2024-06-12T14:34:40Z] <oblivian@deploy1002> Started scap: Backport for [[gerrit:1041656|Use the statsd-exporter service where available (T365265)]]

Mentioned in SAL (#wikimedia-operations) [2024-06-12T14:37:12Z] <oblivian@deploy1002> oblivian: Backport for [[gerrit:1041656|Use the statsd-exporter service where available (T365265)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T14:46:45Z] <oblivian@deploy1002> Finished scap: Backport for [[gerrit:1041656|Use the statsd-exporter service where available (T365265)]] (duration: 12m 05s)

Change #1043602 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mw-parsoid: enable statsd service for mw-parsoid

https://gerrit.wikimedia.org/r/1043602

Change #1043603 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mw-parsoid: send statsd stats to the statsd services

https://gerrit.wikimedia.org/r/1043603

Change #1043602 merged by jenkins-bot:

[operations/deployment-charts@master] mw-parsoid: enable statsd service for mw-parsoid

https://gerrit.wikimedia.org/r/1043602

Change #1043704 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-on-k8s: Deploy statsd exporter

https://gerrit.wikimedia.org/r/1043704

Change #1043705 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-jobrunner: send statsd data to the exporter

https://gerrit.wikimedia.org/r/1043705

Change #1043706 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-api-ext: send statsd data to the exporter

https://gerrit.wikimedia.org/r/1043706

Change #1043707 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-api-int: send statsd data to the exporter

https://gerrit.wikimedia.org/r/1043707

Change #1043708 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-web: send statsd data to the exporter

https://gerrit.wikimedia.org/r/1043708

Change #1043603 merged by jenkins-bot:

[operations/deployment-charts@master] mw-parsoid: send statsd stats to the statsd services

https://gerrit.wikimedia.org/r/1043603

Change #1043704 merged by jenkins-bot:

[operations/deployment-charts@master] mw-on-k8s: Deploy statsd exporter

https://gerrit.wikimedia.org/r/1043704

Mentioned in SAL (#wikimedia-operations) [2024-06-18T10:27:29Z] <cgoubert@deploy1002> Started scap: Deploy statsd exporter - T365265

Mentioned in SAL (#wikimedia-operations) [2024-06-18T10:30:39Z] <cgoubert@deploy1002> Finished scap: Deploy statsd exporter - T365265 (duration: 03m 39s)

statsd-exporter is now deployed on all MW-on-K8s deployments, but mediawiki is not yet configured to send data through it except for mw-parsoid and mw-debug.

image.png (863×2 px, 166 KB)

I think we may want to enable it deployment per deployment so SRE Observability can monitor the load on prometheus. @colewhite or @herron can we coordinate on this?

I think we may want to enable it deployment per deployment so SRE Observability can monitor the load on prometheus. @colewhite or @herron can we coordinate on this?

Sounds good to me, and yes happy to coordinate and keep an eye on prometheus post deployment. Essentially any time that overlaps with Eastern TZ working hours works for me, maybe it's too late for today but how about early next week?

Sure, we can start next Monday around 1400UTC if that works for you?

Mentioned in SAL (#wikimedia-operations) [2024-06-24T15:11:29Z] <claime> Enabling statsd-exporter on mw-jobrunner - T365265

Change #1043705 merged by jenkins-bot:

[operations/deployment-charts@master] mw-jobrunner: send statsd data to the exporter

https://gerrit.wikimedia.org/r/1043705