Page MenuHomePhabricator

StatsD Exporter: gracefully handle metric signature changes
Open, MediumPublic

Description

A metric signature is the combination of metric name and its labels. When StatsD Exporter receives a metric, it registers it at a computed signature and reuses it until restart.

When a conflicting signature is encountered, StatsD Exporter will drop the metric. This can happen when a metric name stays the same but a label is renamed or added.

StatsD Exporter does not indicate which metric is in conflict, but does increment statsd_exporter_events_conflict_total.

For MediaWiki, we cannot guarantee that we will not add/rename/remove labels to any given metric once its in use. This has happened once already.

I see a couple things we could do:

  1. Link the StatsD Exporter lifecycle to the MediaWiki Deployment lifecycle (i.e Restart StatsD Exporter when scap does a deploy)
  2. Configure StatsD Exporter's ttl to something other than 0 (never expire)
  3. <your idea here>

Event Timeline

fgiunchedi subscribed.

Good point re: statsd_exporter_events_conflict_total, looking at a mw-on-k8s world, I think linking the statsd-exporter lifecycle to mw seems the easiest? which also begs the question: maybe it does happen already during mw deployments as pods are cycled?

We also should be alerting on the metric above increasing, looking at the stats lead me to open T360433: Thumbor statsd-exporter metrics conflicts

Currently, mw pods ship their own statsd-exporter as a sidecar. This links the statsd-exporter lifecycle with the deployments.

In T359640, we're discussing the possibility of removing the sidecar to an external service which would make this an issue on mw-on-k8s too.

I dug in some more and found that statsd-exporter as of v0.10.2 allow inconsistent label sets. The conflicting signature issue is present in our environment because we're on a very old version.

From talking with others, the TTL-based lifecycle seems like a good first candidate to try. Perhaps a 30d expiration and tune from there.

colewhite triaged this task as Medium priority.Mon, Jul 1, 9:59 PM