Create a per-release deployment of statsd-exporter for mw-on-k8s
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Clement_Goubert
	May 17 2024, 3:19 PM

Description

Define a second deployment named $namespace.$environment.$release-statsd

With at least 2 replicas
Reusing base.statsd.container, base.statsd.volume and base.statsd.configmap for convenience. This already includes the external port for prometheus to scrape when statsd is enabled in values.yaml.
Adding a UDP port reachable from the mediawiki pod's local envoy

Add configuration to mediawiki's local envoy to listen on UDP localhost:9125 (matching $wgStatsTarget = 'udp://localhost:9125') and upstream to $namespace.$environment.$release-statsd on the previously defined port.

Details

Subject	Repo	Branch	Lines +/-
CommonSettings: Rename unregistered wgStatsHost to local "statsHost" var	operations/mediawiki-config	master	+11 -5
Use the statsd-exporter service where available	operations/mediawiki-config	master	+5 -1
mediawiki: consistently apply stats-global values via symlink	operations/deployment-charts	master	+16 -5
mw-on-k8s: Move php.envvars to mediawiki-common	operations/deployment-charts	master	+16 -86
mw-wikifunctions: deploy statsd-exporter	operations/deployment-charts	master	+35 -2
mw-misc: deploy statsd-exporter	operations/deployment-charts	master	+33 -2
mw-web: send statsd data to the exporter	operations/deployment-charts	master	+15 -1
mw-api-int: send statsd data to the exporter	operations/deployment-charts	master	+17 -1
mw-api-ext: send statsd data to the exporter	operations/deployment-charts	master	+17 -1
mw-jobrunner: send statsd data to the exporter	operations/deployment-charts	master	+15 -1
mw-on-k8s: Deploy statsd exporter	operations/deployment-charts	master	+106 -17
mw-parsoid: send statsd stats to the statsd services	operations/deployment-charts	master	+20 -4
mw-parsoid: enable statsd service for mw-parsoid	operations/deployment-charts	master	+34 -2
mw-debug: start using php.envvars, expose statsd-exporter	operations/deployment-charts	master	+11 -36
mw-debug: add statsd service everywhere	operations/deployment-charts	master	+1 -1
statsd: add deployment to mw-debug (codfw only)	operations/deployment-charts	master	+19 -1
Add new chart statsd-exporter	operations/deployment-charts	master	+374 -0
mediawiki: Add external statsd-exporter deployment	operations/deployment-charts	master	+172 -1

Related Objects
Search...

Status	Assigned	Task
Open	None	T343020 Converting MediaWiki Metrics to StatsLib
Resolved	herron	T350591 Audit legacy mediawiki stats used in production dashboards
Open	None	T350592 EPIC: migrate in use metrics and dashboards to statslib
Resolved	DAlangi_WMF	T355960 Migrate MediaWiki.resourceloader* metrics to statslib
Resolved	fgiunchedi	T359640 mediawiki_resourceloader_build_seconds_bucket big metric on Prometheus ops
Resolved	Clement_Goubert	T365265 Create a per-release deployment of statsd-exporter for mw-on-k8s

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Clement_Goubert added a subscriber: Kappakayala.May 17 2024, 3:32 PM

I might be missing something obvious here, but I've two questions:

Why add the statsd deployment to the mediawiki chart instead of using a statsd chart, adding a statsd release to mediawiki helmfile.yaml's?
Why do we need to tunnel statsd trough the local envoy? Can't mediawiki use $namespace.$environment.$release-statsd directly?

I can't authoritatively answer the questions though IIRC from my chat with @Clement_Goubert on using mesh/envoy vs not it was for symmetry with the rest of mw. To be clear: I don't feel strongly either way, whichever is best practice in this case works for me (ditto for the first question FWIW)

In T365265#9815590, @JMeybohm wrote:

I might be missing something obvious here, but I've two questions:

Why add the statsd deployment to the mediawiki chart instead of using a statsd chart, adding a statsd release to mediawiki helmfile.yaml's?

Because I threw that proposal and associated WIP patch together in an afternoon, and didn't think about this solution. It would probably be cleaner to do it this way, yes.

Why do we need to tunnel statsd trough the local envoy? Can't mediawiki use $namespace.$environment.$release-statsd directly?

As @fgiunchedi said, this seemed quicker as it removed the need to change mediawiki-config as well as the image and php-fpm config. You're right we can re-use the pattern that is used for the memcached daemonset of passing the statsd address through an environment variable to php-fpm and catch it in mediawiki-config.

I'd like a little more in put on this before proceeding though, @akosiaris what do you think?

Clement_Goubert mentioned this in T363186: Cache mw-mcrouter service ClusterIP in apcu cache.May 30 2024, 10:59 AM

Thanks for your help!

As we discussed on IRC, namespace isn't the right term. This came from a fundamental misunderstanding about how MW versions are handled. Given that we live in a MWMultiVersion world, we need to have MediaWiki direct the metrics traffic to the appropriate statsd-exporter instance based on group.

The core problem is when a metric signature changes due to code change, statsd-exporter will drop metrics for a metric name provided a mismatched signature resident in memory (metric signature = name + label keys).

In order to support multiple versions in production and not drop metrics due to a race condition, we need statsd-exporter instances per group.

In addition, the deployment tool must restart the group's statsd-exporter instance to clear any previous metric instances resident in memory. T359497

Hmm, this complicates things.

We have 2 hypotheses in next years APP to see if we can start replacing multiversion mediawiki with something differently implemented that achieves a similar end goal (Wikipedia's being hit with new technical changes on Thusdays) . It's something that has grown organically to match our older infrastructure and it no longer makes that much sense in our newer infrastructure.

Scott_French subscribed.Jun 3 2024, 10:23 PM

A few thoughts on this:

I think using daemonsets is a better option than a deployment. It will ensure UDP fire-and-forget communication (which is in itself scary in k8s with all the layers of indirection) happens locally to the physical node. For the pods to connect to it, we'll just have to pass down the node ip as an env variable to php-fpm, using the downward api to pass status.hostIP.
This has the added advantage of reproducing the load pattern we have on bare metal, of course
As for one release per MediaWiki version, it seems to me like it could be pretty wasteful: almost 90% of the metrics traffic is from group2, for instance. We'll need to run some maths on the amount of resources we'd need, but I think we should explore alternatives, specifically:
1. Look at contributing to upstream statsd-exporter to add a switch allowing to not discard metrics with new signatures, given prometheus won't complain about it AIUI
2. Institute development policies that will prevent the problem from happening regularly
Is there any good reason why we shouldn't set a short-ish TTL on metrics in statsd-exporter? I don't think restarting it with every deployment is really a great idea, I'd rather spend a bit of cpu more and ensure we don't lose metrics for more than a minute during deployments at worst.

In T365265#9858015, @Joe wrote:

A few thoughts on this:

I think using daemonsets is a better option than a deployment. It will ensure UDP fire-and-forget communication (which is in itself scary in k8s with all the layers of indirection) happens locally to the physical node.

Just to note that we don't need this if we want to keep the current status quo. We are in a fire and forget situation right now. Everything UDP fires-and-forgets to statsd.eqiad.wmnet (with hardcoded IP to avoid DNS requests). If we want to increase the reliability of that pathway, we should at least have a conversation as to why that is needed (e.g. are we experiencing issues with missing metrics currently?)

For the pods to connect to it, we'll just have to pass down the node ip as an env variable to php-fpm, using the downward api to pass status.hostIP.

This will require the pods to use hostNetwork: true which isn't a good idea. The only cases where hostNetwork: true makes sense are CNIs and generally workloads that need to mess with the host's networking configuration. e.g. The only ones I see in deployment-charts are calico stuff. There is support in a few other charts (e.g. cert-manager, jaeger, etc) but it is disabled in our infra. Furthermore, our PSP (and I assume our newest PSS) won't allow that anywhere else than kube-system (and that's a very good idea).

On a more general note, if one ever finds themselves turning to the downward API for communication reasons, they are reaching for the wrong tool in their toolbet.

There is an alternative and it's the one we went for with the mcrouter daemonset. A Service with InternalTrafficPolicy: local. We can even hardcode the IP address to avoid DNS requests.

This has the added advantage of reproducing the load pattern we have on bare metal, of course

Indeed, but I 'll point out that Grafana exploring the metrics currently emitted via statsd-exporter on our current baremetal is proving to be difficult. And we are talking about a 1:6 ratio (legacy vs wikikube hardware nodes) currently? I am unsure this will actually manage to scale?

As for one release per MediaWiki version, it seems to me like it could be pretty wasteful: almost 90% of the metrics traffic is from group2, for instance. We'll need to run some maths on the amount of resources we'd need, but I think we should explore alternatives, specifically:

Look at contributing to upstream statsd-exporter to add a switch allowing to not discard metrics with new signatures, given prometheus won't complain about it AIUI

Institute development policies that will prevent the problem from happening regularly

Agreed on this one. Correct me if I am wrong, but we don't currently (statsd.eqiad.wmnet) differentiate by group, I 'd rather we didn't start to differentiate because of some behavior of statsd-exporter.

In T365265#9858506, @akosiaris wrote:

In T365265#9858015, @Joe wrote:

A few thoughts on this:

I think using daemonsets is a better option than a deployment. It will ensure UDP fire-and-forget communication (which is in itself scary in k8s with all the layers of indirection) happens locally to the physical node.

Just to note that we don't need this if we want to keep the current status quo. We are in a fire and forget situation right now. Everything UDP fires-and-forgets to statsd.eqiad.wmnet (with hardcoded IP to avoid DNS requests). If we want to increase the reliability of that pathway, we should at least have a conversation as to why that is needed (e.g. are we experiencing issues with missing metrics currently?)

I am still unclear on when metrics are sent directly to statsd.eqiad.wmnet or to the prometheus-statsd-exporter through envoy. Is this a transitional state towards using prometheus to collect metrics and both are required for now, or are some metrics sent through one path and others through the second? I think the scope of this task is to solve the second path only (MediaWiki -> statsd-exporter -> prometheus).

In T365265#9858663, @Clement_Goubert wrote:

In T365265#9858506, @akosiaris wrote:

In T365265#9858015, @Joe wrote:

A few thoughts on this:

I think using daemonsets is a better option than a deployment. It will ensure UDP fire-and-forget communication (which is in itself scary in k8s with all the layers of indirection) happens locally to the physical node.

Just to note that we don't need this if we want to keep the current status quo. We are in a fire and forget situation right now. Everything UDP fires-and-forgets to statsd.eqiad.wmnet (with hardcoded IP to avoid DNS requests). If we want to increase the reliability of that pathway, we should at least have a conversation as to why that is needed (e.g. are we experiencing issues with missing metrics currently?)

I am still unclear on when metrics are sent directly to statsd.eqiad.wmnet or to the prometheus-statsd-exporter through envoy. Is this a transitional state towards using prometheus to collect metrics and both are required for now, or are some metrics sent through one path and others through the second? I think the scope of this task is to solve the second path only (MediaWiki -> statsd-exporter -> prometheus).

MW metrics are being sent to both right now. Using different code paths (and configuration) of course. There's probably discrepancies between the 2 sets of metrics of course, because we are using different protocols, e.g. in prometheus we have the histograms there, in graphite (statsd.eqiad.wmnet) we got the timings with different functionalities.

The task is indeed about the second path only, but it's always useful to remember what it is we are migrating away from. I think it will help us to avoid over engineering and solving problems that might not be there or matter much.

In T365265#9858506, @akosiaris wrote:

In T365265#9858015, @Joe wrote:

A few thoughts on this:

I think using daemonsets is a better option than a deployment. It will ensure UDP fire-and-forget communication (which is in itself scary in k8s with all the layers of indirection) happens locally to the physical node.

Just to note that we don't need this if we want to keep the current status quo. We are in a fire and forget situation right now. Everything UDP fires-and-forgets to statsd.eqiad.wmnet (with hardcoded IP to avoid DNS requests). If we want to increase the reliability of that pathway, we should at least have a conversation as to why that is needed (e.g. are we experiencing issues with missing metrics currently?)

Just to go on record, we did some basic napkin math with @fgiunchedi and we can't really support this number of instances given the sheer cardinality of the mediawiki metrics. So we will need to create a deployment, as proposed previously.

Indeed, but I 'll point out that Grafana exploring the metrics currently emitted via statsd-exporter on our current baremetal is proving to be difficult. And we are talking about a 1:6 ratio (legacy vs wikikube hardware nodes) currently? I am unsure this will actually manage to scale?

See above, indeed it won't :)

Agreed on this one. Correct me if I am wrong, but we don't currently (statsd.eqiad.wmnet) differentiate by group, I 'd rather we didn't start to differentiate because of some behavior of statsd-exporter.

We have quite a few problems around this, I'll write more extensively about it below.

After a lot of back and forth between solutions all having downsides, I think a reasonable solution we can use is, as long as we're only interested in aggregates of metrics and know single instance values are meaningless:

Create a simple statsd-exporter chart that deploys statsd exporter with a ClusterIP service
Add a deployment of this chart to each mediawiki namespace. I would probably go with a release in the main helmfile, but maybe a separate helmfile is better
Make mediawiki connect to the statsd-exporter service via the ClusterIP

In T365265#9858015, @Joe wrote:

A few thoughts on this:

I think using daemonsets is a better option than a deployment. It will ensure UDP fire-and-forget communication (which is in itself scary in k8s with all the layers of indirection) happens locally to the physical node. For the pods to connect to it, we'll just have to pass down the node ip as an env variable to php-fpm, using the downward api to pass status.hostIP.

This has the added advantage of reproducing the load pattern we have on bare metal, of course

As for one release per MediaWiki version, it seems to me like it could be pretty wasteful: almost 90% of the metrics traffic is from group2, for instance. We'll need to run some maths on the amount of resources we'd need, but I think we should explore alternatives, specifically:

Look at contributing to upstream statsd-exporter to add a switch allowing to not discard metrics with new signatures, given prometheus won't complain about it AIUI

I've looked into this and indeed Prometheus will ingest metrics with inconsistent sets of labels, though that's quite discouraged and I had to basically serve a plaintext file. AFAICT what happens instead with the golang prometheus client (e.g. node-exporter) is that missing labels will get an empty value, e.g. this in a .prom file

test_foo{bar="baz"} 1
test_foo{bar="baz", meh="baz"} 1

Will result in this being served by node-exporter:

test_foo{bar="baz",meh=""} 1
test_foo{bar="baz", meh="baz"} 1

Institute development policies that will prevent the problem from happening regularly

Is there any good reason why we shouldn't set a short-ish TTL on metrics in statsd-exporter? I don't think restarting it with every deployment is really a great idea, I'd rather spend a bit of cpu more and ensure we don't lose metrics for more than a minute during deployments at worst.

To be clear I was under the impression that "restart" in this context meant k8s' idea of a restart, i.e. progressively recycling pods so there's always capacity available.

In T365265#9858995, @Joe wrote:

After a lot of back and forth between solutions all having downsides, I think a reasonable solution we can use is, as long as we're only interested in aggregates of metrics and know single instance values are meaningless:

Create a simple statsd-exporter chart that deploys statsd exporter with a ClusterIP service

Add a deployment of this chart to each mediawiki namespace. I would probably go with a release in the main helmfile, but maybe a separate helmfile is better

Make mediawiki connect to the statsd-exporter service via the ClusterIP

Given the solution space we've explored this sounds like a viable option to me! To the point of "single instance values are meaningless" this is already true with statsd since we're aggregating centrally; in other words we're already used/expecting mw stats to be meaningless on a per-host basis

In T365265#9858015, @Joe wrote:

As for one release per MediaWiki version, it seems to me like it could be pretty wasteful: almost 90% of the metrics traffic is from group2, for instance. We'll need to run some maths on the amount of resources we'd need, but I think we should explore alternatives, specifically:

Look at contributing to upstream statsd-exporter to add a switch allowing to not discard metrics with new signatures, given prometheus won't complain about it AIUI

The need for a per-group statsd-exporter is necessary only if we continue to use statsd-exporter v0.9.0. Upstream enabled the ability to have inconsistent label sets in v0.10.2 I suggest we upgrade statsd-exporter and eliminate the need for per-group exporters: T302373: Upgrade prometheus-statsd-exporter

Please express your support for a TTL or other solution on T359497.

Institute development policies that will prevent the problem from happening regularly

I feel that generating development policies around this should be avoided if possible.

Change #1039171 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] Add new chart statsd-exporter

https://gerrit.wikimedia.org/r/1039171

In T365265#9860976, @colewhite wrote:

In T365265#9858015, @Joe wrote:

As for one release per MediaWiki version, it seems to me like it could be pretty wasteful: almost 90% of the metrics traffic is from group2, for instance. We'll need to run some maths on the amount of resources we'd need, but I think we should explore alternatives, specifically:

Look at contributing to upstream statsd-exporter to add a switch allowing to not discard metrics with new signatures, given prometheus won't complain about it AIUI

The need for a per-group statsd-exporter is necessary only if we continue to use statsd-exporter v0.9.0. Upstream enabled the ability to have inconsistent label sets in v0.10.2 I suggest we upgrade statsd-exporter and eliminate the need for per-group exporters: T302373: Upgrade prometheus-statsd-exporter

This is great! I'll be resuming the work on the upgrade

Change #1039233 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] statsd: add deployment to mw-debug (codfw only)

https://gerrit.wikimedia.org/r/1039233

Change #1039234 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mw-debug: add statsd service everywhere

https://gerrit.wikimedia.org/r/1039234

Change #1032795 abandoned by Clément Goubert:

[operations/deployment-charts@master] mediawiki: Add external statsd-exporter deployment

Reason:

Abandoned in favor of I7f589fa1e3ace14cce4b8df74d5adae3c3ccb68b

https://gerrit.wikimedia.org/r/1032795

Change #1039779 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mw-debug: start using php.envvars, expose statsd-exporter

https://gerrit.wikimedia.org/r/1039779

Change #1039171 merged by jenkins-bot:

[operations/deployment-charts@master] Add new chart statsd-exporter

https://gerrit.wikimedia.org/r/1039171

Change #1039233 merged by jenkins-bot:

[operations/deployment-charts@master] statsd: add deployment to mw-debug (codfw only)

https://gerrit.wikimedia.org/r/1039233

Change #1039234 merged by Giuseppe Lavagetto:

[operations/deployment-charts@master] mw-debug: add statsd service everywhere

https://gerrit.wikimedia.org/r/1039234

Change #1039779 merged by Giuseppe Lavagetto:

[operations/deployment-charts@master] mw-debug: start using php.envvars, expose statsd-exporter

https://gerrit.wikimedia.org/r/1039779

Change #1041656 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/mediawiki-config@master] Use the statsd-exporter service where available

https://gerrit.wikimedia.org/r/1041656

Change #1041656 merged by jenkins-bot:

[operations/mediawiki-config@master] Use the statsd-exporter service where available

https://gerrit.wikimedia.org/r/1041656

Mentioned in SAL (#wikimedia-operations) [2024-06-12T14:34:40Z] <oblivian@deploy1002> Started scap: Backport for [[gerrit:1041656|Use the statsd-exporter service where available (T365265)]]

Mentioned in SAL (#wikimedia-operations) [2024-06-12T14:37:12Z] <oblivian@deploy1002> oblivian: Backport for [[gerrit:1041656|Use the statsd-exporter service where available (T365265)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T14:46:45Z] <oblivian@deploy1002> Finished scap: Backport for [[gerrit:1041656|Use the statsd-exporter service where available (T365265)]] (duration: 12m 05s)

Change #1043602 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mw-parsoid: enable statsd service for mw-parsoid

https://gerrit.wikimedia.org/r/1043602

Change #1043603 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mw-parsoid: send statsd stats to the statsd services

https://gerrit.wikimedia.org/r/1043603

andrea.denisse unsubscribed.Jun 14 2024, 7:34 AM

Change #1043602 merged by jenkins-bot:

[operations/deployment-charts@master] mw-parsoid: enable statsd service for mw-parsoid

https://gerrit.wikimedia.org/r/1043602

Change #1043704 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-on-k8s: Deploy statsd exporter

https://gerrit.wikimedia.org/r/1043704

Change #1043705 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-jobrunner: send statsd data to the exporter

https://gerrit.wikimedia.org/r/1043705

Change #1043706 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-api-ext: send statsd data to the exporter

https://gerrit.wikimedia.org/r/1043706

Change #1043707 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-api-int: send statsd data to the exporter

https://gerrit.wikimedia.org/r/1043707

Change #1043708 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-web: send statsd data to the exporter

https://gerrit.wikimedia.org/r/1043708

Change #1043603 merged by jenkins-bot:

[operations/deployment-charts@master] mw-parsoid: send statsd stats to the statsd services

https://gerrit.wikimedia.org/r/1043603

Change #1043704 merged by jenkins-bot:

[operations/deployment-charts@master] mw-on-k8s: Deploy statsd exporter

https://gerrit.wikimedia.org/r/1043704

Mentioned in SAL (#wikimedia-operations) [2024-06-18T10:27:29Z] <cgoubert@deploy1002> Started scap: Deploy statsd exporter - T365265

Mentioned in SAL (#wikimedia-operations) [2024-06-18T10:30:39Z] <cgoubert@deploy1002> Finished scap: Deploy statsd exporter - T365265 (duration: 03m 39s)

statsd-exporter is now deployed on all MW-on-K8s deployments, but mediawiki is not yet configured to send data through it except for mw-parsoid and mw-debug.

I think we may want to enable it deployment per deployment so SRE Observability can monitor the load on prometheus. @colewhite or @herron can we coordinate on this?

Clement_Goubert moved this task from Backlog to Ready on the MW-on-K8s board.Jun 18 2024, 12:01 PM

In T365265#9902980, @Clement_Goubert wrote:

I think we may want to enable it deployment per deployment so SRE Observability can monitor the load on prometheus. @colewhite or @herron can we coordinate on this?

Sounds good to me, and yes happy to coordinate and keep an eye on prometheus post deployment. Essentially any time that overlaps with Eastern TZ working hours works for me, maybe it's too late for today but how about early next week?

Sure, we can start next Monday around 1400UTC if that works for you?

Sounds like a plan!

Mentioned in SAL (#wikimedia-operations) [2024-06-24T15:11:29Z] <claime> Enabling statsd-exporter on mw-jobrunner - T365265

Change #1043705 merged by jenkins-bot:

[operations/deployment-charts@master] mw-jobrunner: send statsd data to the exporter

https://gerrit.wikimedia.org/r/1043705

Change #1043706 merged by jenkins-bot:

[operations/deployment-charts@master] mw-api-ext: send statsd data to the exporter

https://gerrit.wikimedia.org/r/1043706

Mentioned in SAL (#wikimedia-operations) [2024-06-25T15:20:00Z] <claime> Deploying statsd to mw-api-ext - T365265

Mentioned in SAL (#wikimedia-operations) [2024-06-26T14:01:36Z] <claime> Deploying statsd-exporter for mw-api-int - T365265

Change #1043707 merged by jenkins-bot:

[operations/deployment-charts@master] mw-api-int: send statsd data to the exporter

https://gerrit.wikimedia.org/r/1043707

colewhite mentioned this in T363914: Discrepancy between Graphite & Prometheus editResponseTime counts.Jun 28 2024, 8:03 PM

Change #1043708 merged by jenkins-bot:

[operations/deployment-charts@master] mw-web: send statsd data to the exporter

https://gerrit.wikimedia.org/r/1043708

Mentioned in SAL (#wikimedia-operations) [2024-07-01T14:55:14Z] <claime> deploying statsd-exporter for mw-web - T365265

All main deployments of mw-on-k8s now send data to statsd-exporter.
Remaining are mw-misc, mw-wikifunctions, and the yet to be used mw-script and mw-videoscaler which I'll do this week, but won't be sending much data.

lmata edited projects, added SRE Observability (FY2024/2025-Q1); removed SRE Observability (FY2023/2024-Q4).Jul 1 2024, 8:35 PM

Change #1051367 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-misc: deploy statsd-exporter

https://gerrit.wikimedia.org/r/1051367

Change #1051368 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-wikifunctions: deploy statsd-exporter

https://gerrit.wikimedia.org/r/1051368

Change #1051367 merged by jenkins-bot:

[operations/deployment-charts@master] mw-misc: deploy statsd-exporter

https://gerrit.wikimedia.org/r/1051367

Change #1051368 merged by jenkins-bot:

[operations/deployment-charts@master] mw-wikifunctions: deploy statsd-exporter

https://gerrit.wikimedia.org/r/1051368

Done for mw-misc and mw-wikifunctions.

Change #1051711 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-on-k8s: Move php.envvars to mediawiki-common

https://gerrit.wikimedia.org/r/1051711

Change #1051711 merged by jenkins-bot:

[operations/deployment-charts@master] mw-on-k8s: Move php.envvars to mediawiki-common

https://gerrit.wikimedia.org/r/1051711

Mentioned in SAL (#wikimedia-operations) [2024-07-03T11:16:55Z] <cgoubert@deploy1002> Started scap sync-world: mw-on-k8s: Move php.envvars to mediawiki-common - T365265

Mentioned in SAL (#wikimedia-operations) [2024-07-03T11:21:44Z] <cgoubert@deploy1002> Finished scap: mw-on-k8s: Move php.envvars to mediawiki-common - T365265 (duration: 05m 22s)

Tagging in @RLazarus for mw-script, I don't know how you want to handle it given the transient nature of the environment.

Lowering priority now that the production deployments of mw-on-k8s have been done.

I'm bumping this task for visibility as it appears from investigation on T371885: Gaps in Grafana graphs using Thanos that statsd-exporter is being heavily cpu-throttled. It happens just infrequently enough that Prometheus can't scrape metrics that we haven't noticed before since it isn't a zero/one failure but more like a gray failure

Update from T371885, throttling didn't improve the situation, I'll keep updating the other task as the investigation continues

Change #1063031 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] mediawiki: consistently apply stats-global values via symlink

https://gerrit.wikimedia.org/r/1063031

Change #1063031 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: consistently apply stats-global values via symlink

https://gerrit.wikimedia.org/r/1063031

Change #1063913 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] CommonSettings: Rename unregistered wgStatsHost to local "statsHost" var

https://gerrit.wikimedia.org/r/1063913

Back to low as it turns out I went down the wrong rabbit hole with my statsd-exporter investigation

akosiaris mentioned this in T376714: Evaluate running a statsd-exporter in the mw-script namespace.Oct 14 2024, 10:55 AM

lmata moved this task from Inbox to Backlog on the Observability-Metrics board.Oct 30 2024, 6:58 PM

lmata moved this task from Backlog to Prioritized on the Observability-Metrics board.

lmata moved this task from Inbox to Epics In Progress on the SRE Observability (FY2024/2025-Q1) board.Nov 5 2024, 5:03 PM

lmata edited projects, added SRE Observability (FY2024/2025-Q2); removed SRE Observability (FY2024/2025-Q1).

lmata moved this task from Inbox to Epics In Progress on the SRE Observability (FY2024/2025-Q2) board.

I believe this is done for all intents and purposes, what do you think @Clement_Goubert ?

There will be a deployement for the future mw-cron and maybe for mw-videoscaler (@hnowlan will be able to weigh in on this after the holidays), but we can handle that in separates issues, the main deployments of mediawiki are indeed covered. Resolving.

	F55434679: image.png
	Jun 18 2024, 11:55 AM

Create a per-release deployment of statsd-exporter for mw-on-k8sClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Create a per-release deployment of statsd-exporter for mw-on-k8s
Closed, ResolvedPublic
Actions

Related Objects
Search...