Page MenuHomePhabricator

Notifications when prometheus daemons are wedged
Open, HighPublic

Description

We have some stats collection that only happens on master nodes of the elasticsearch clusters. While reviewing some dashboards today I noticed that one hadn't been reporting data. Logging into the instance and testing the ports, the local prometheus daemon was accepting connections but never responding. Restarting the daemon (prometheus-wmf-elasticsearch-exporter) restored stats collection.

It seems we should expect that daemons may get stuck at some point or another, the main issue here seems to be that we didn't have any way to know this daemon stopped working correctly other than looking at the stats, monitors like the systemd service are happy that the process is still running.

AC: Daemons are automatically restarted, repeated failures gives some sort of notice

Event Timeline

MPhamWMF triaged this task as High priority.Mar 8 2021, 4:32 PM
MPhamWMF moved this task from needs triage to Ops / SRE on the Discovery-Search board.

We have a generic alert that's supported to cover similar cases (but more widespread issues) namely alerting on "job availability", i.e. alert if a certain percentage of targets in a job are not scrapable. The related dashboard is https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets. I suspect because in this case the availability is still quite high even with a few targets down. I'm mentioning this to provide an inspiration for an alert, if e.g. you need to know when even a single target is down. Hope that helps!

Hello @EBernhardson, moving to radar for now, please let us know how you'd like to proceed and if you need assistance. thanks!