Page MenuHomePhabricator

Alert on missing Prometheus metrics
Closed, ResolvedPublic

Description

Today with T325533 we started to miss out on metrics in Prometheus and I missed that since I was travelling and working on other things. Lets create a simple alert that checks that we have metrics in Prometheus and if we miss it, alert.

Event Timeline

Seea also T225739: Create Icinga check for navtiming.py service health, which added meta metrics to our Prometheus exporter in navtiming.py and SRE o11y also helped us with alerts at the time. These alerts did not fire, so I guess something went wrong there unless those metrics genuinely kept working while only some of them stopped?

Looking at the meta metrics, webperf_latest_handled_time_seconds, I see that indeed it has a multi-hour gap https://w.wiki/68Dc:

Screenshot 2022-12-19 at 17.06.46.png (1×2 px, 341 KB)

This plots the difference between 1) the timestamp we store during the regular running of navtiming.py (set to current clocktime on webperf server every time we process a message from Kafka), and 2) the clock time of the Prometheus server at the time it scraped that metric from navtiming.py. In other words, it measures the time we take in our python process to iterate and send the metrics (only a few milliseconds) plus the time between that and the next scrape. This is normally around 45-60 seconds since Prometheus in production scrapes slightly faster than once a minute typically.

The alert query for this is in https://gerrit.wikimedia.org/g/operations/alerts/+/refs/heads/master/team-perf/. I suspect that what's going on is, when the metric isn't reported, it yields null or yields the diff of the last minute as it was at the time. This would correctly detect when the service is working but backlogged due to slow processing or something, but would not detect any kind of actual downtime or collection problem as was the case here.

See also: T323749: Add documentation for rollback/if something fails when deploying navtiming.py.

For now I added one alert query for CPU benchmark metrics (that also fired when it stopped to work yesterday) and lets look into the problem why the other alert didn't fire too.

I changed so if we are missing out on first Contentful paint, Grafana will alert. That should be enough for now.