Page MenuHomePhabricator

prometheus-statsd-exporter failure to start due to invalid yaml config
Closed, ResolvedPublic

Description

While investigating puppet failures in T299468 I noticed statsd-exporter couldn't parse its config anymore, and indeed it looks like certain yaml fields went from float to string:

Feb 17 20:03:32 ms-be2050 p...5]: (/...e[/etc/prometheus/statsd_exporter.conf]/content) +---
Feb 17 20:03:32 ms-be2050 p...5]: (/...e[/etc/prometheus/statsd_exporter.conf]/content)  defaults:
Feb 17 20:03:32 ms-be2050 p...5]: (/...e[/etc/prometheus/statsd_exporter.conf]/content) -  quantiles:
Feb 17 20:03:32 ms-be2050 p...5]: (/...e[/etc/prometheus/statsd_exporter.conf]/content) -  - error: 0.001
Feb 17 20:03:32 ms-be2050 p...5]: (/...e[/etc/prometheus/statsd_exporter.conf]/content) -    quantile: 0.99
Feb 17 20:03:32 ms-be2050 p...5]: (/...e[/etc/prometheus/statsd_exporter.conf]/content) -  - error: 0.001
Feb 17 20:03:32 ms-be2050 p...5]: (/...e[/etc/prometheus/statsd_exporter.conf]/content) -    quantile: 0.95
Feb 17 20:03:32 ms-be2050 p...5]: (/...e[/etc/prometheus/statsd_exporter.conf]/content) -  - error: 0.001
Feb 17 20:03:32 ms-be2050 p...5]: (/...e[/etc/prometheus/statsd_exporter.conf]/content) -    quantile: 0.75
Feb 17 20:03:32 ms-be2050 p...5]: (/...e[/etc/prometheus/statsd_exporter.conf]/content) -  - error: 0.005
Feb 17 20:03:32 ms-be2050 p...5]: (/...e[/etc/prometheus/statsd_exporter.conf]/content) -    quantile: 0.5
Feb 17 20:03:32 ms-be2050 p...5]: (/...e[/etc/prometheus/statsd_exporter.conf]/content)    timer_type: summary
Feb 17 20:03:32 ms-be2050 p...5]: (/...e[/etc/prometheus/statsd_exporter.conf]/content) +  quantiles:
Feb 17 20:03:32 ms-be2050 p...5]: (/...e[/etc/prometheus/statsd_exporter.conf]/content) +  - quantile: '0.99'
Feb 17 20:03:32 ms-be2050 p...5]: (/...e[/etc/prometheus/statsd_exporter.conf]/content) +    error: '0.001'
Feb 17 20:03:32 ms-be2050 p...5]: (/...e[/etc/prometheus/statsd_exporter.conf]/content) +  - quantile: '0.95'
Feb 17 20:03:32 ms-be2050 p...5]: (/...e[/etc/prometheus/statsd_exporter.conf]/content) +    error: '0.001'
Feb 17 20:03:32 ms-be2050 p...5]: (/...e[/etc/prometheus/statsd_exporter.conf]/content) +  - quantile: '0.75'
Feb 17 20:03:32 ms-be2050 p...5]: (/...e[/etc/prometheus/statsd_exporter.conf]/content) +    error: '0.001'
Feb 17 20:03:32 ms-be2050 p...5]: (/...e[/etc/prometheus/statsd_exporter.conf]/content) +  - quantile: '0.50'
Feb 17 20:03:32 ms-be2050 p...5]: (/...e[/etc/prometheus/statsd_exporter.conf]/content) +    error: '0.005'

The fields are strings in puppet so it makes sense to write the values quoted in the file. This change is due to the move from ordered_yaml to to_yaml, I'll send a patch to fix the values in puppet. cc @jhathaway as there might be other cases lurking around of yaml configs (previously) working by chance).

This problem is an unfortunate combination of running an old version of statsd-exporter that doesn't support checking its config (and thus allow us to use validate_cmd in puppet)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 765203 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: fix quantile config value type

https://gerrit.wikimedia.org/r/765203

Change 765203 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: fix quantile config value type

https://gerrit.wikimedia.org/r/765203

Mentioned in SAL (#wikimedia-operations) [2022-02-23T09:02:44Z] <godog> bounce prometheus-statsd-exporter on C:prometheus::statsd_exporter - T302372

@fgiunchedi very sorry about the breakage, I wish I would have caught that in the review.

No worries @jhathaway ! It was a combination of factors that meant deployment would fail silently too :( i.e. no puppet failures or we would have noticed