Page MenuHomePhabricator

StatsD Exporter: gracefully handle metric signature changes
Closed, ResolvedPublic

Description

A metric signature is the combination of metric name and its labels. When StatsD Exporter receives a metric, it registers it at a computed signature and reuses it until restart.

When a conflicting signature is encountered, StatsD Exporter will drop the metric. This can happen when a metric name stays the same but a label is renamed or added.

StatsD Exporter does not indicate which metric is in conflict, but does increment statsd_exporter_events_conflict_total.

For MediaWiki, we cannot guarantee that we will not add/rename/remove labels to any given metric once its in use. This has happened once already.

I see a couple things we could do:

  1. Link the StatsD Exporter lifecycle to the MediaWiki Deployment lifecycle (i.e Restart StatsD Exporter when scap does a deploy)
  2. Configure StatsD Exporter's ttl to something other than 0 (never expire)
  3. <your idea here>

Event Timeline

fgiunchedi subscribed.

Good point re: statsd_exporter_events_conflict_total, looking at a mw-on-k8s world, I think linking the statsd-exporter lifecycle to mw seems the easiest? which also begs the question: maybe it does happen already during mw deployments as pods are cycled?

We also should be alerting on the metric above increasing, looking at the stats lead me to open T360433: Thumbor statsd-exporter metrics conflicts

Currently, mw pods ship their own statsd-exporter as a sidecar. This links the statsd-exporter lifecycle with the deployments.

In T359640, we're discussing the possibility of removing the sidecar to an external service which would make this an issue on mw-on-k8s too.

I dug in some more and found that statsd-exporter as of v0.10.2 allow inconsistent label sets. The conflicting signature issue is present in our environment because we're on a very old version.

From talking with others, the TTL-based lifecycle seems like a good first candidate to try. Perhaps a 30d expiration and tune from there.

colewhite triaged this task as Medium priority.Jul 1 2024, 9:59 PM

Change #1105971 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] prometheus: add ttl option to statsd-exporter, set to 30d

https://gerrit.wikimedia.org/r/1105971

Change #1105972 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/deployment-charts@master] statsd-exporter: set ttl to 30d

https://gerrit.wikimedia.org/r/1105972

Change #1105971 merged by Cwhite:

[operations/puppet@production] prometheus: add ttl option to statsd-exporter, set to 30d

https://gerrit.wikimedia.org/r/1105971

Change #1105972 merged by jenkins-bot:

[operations/deployment-charts@master] statsd-exporter: set ttl to 30d

https://gerrit.wikimedia.org/r/1105972

Change #1117638 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/deployment-charts@master] move statsd config to statsd-global, bump statsd chart version

https://gerrit.wikimedia.org/r/1117638

Change #1128471 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] statsd_exporter: bugfix set ttl to associated variable

https://gerrit.wikimedia.org/r/1128471

Change #1117638 merged by jenkins-bot:

[operations/deployment-charts@master] move statsd config to statsd-global, bump statsd chart version

https://gerrit.wikimedia.org/r/1117638

Change #1128471 merged by Cwhite:

[operations/puppet@production] statsd_exporter: bugfix set ttl to associated variable

https://gerrit.wikimedia.org/r/1128471

colewhite claimed this task.

Config deployed!