When discussing solutions to the parent task (i.e. flood of alerts on IRC during incidents) it became apparent that we still want to be aware of individual hosts being down (from pybal's perspective). Since overall service health is already monitored via network probes we can be a little more lax in how we alert in this case.
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | fgiunchedi | T314118 Reduce IRC flood/spam during incidents | |||
| Resolved | fgiunchedi | T320627 Alert on individual pybal backend hosts being down for a long time | |||
| Resolved | fgiunchedi | T321191 Cleanup pybal Prometheus metrics on monitor stop() |
Event Timeline
Change 841905 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/alerts@master] sre: test warning on pybal backends being down for long
Change 841905 merged by Filippo Giunchedi:
[operations/alerts@master] sre: test warning on pybal backends being down for long
Change 844429 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/alerts@master] sre: remove 'host' label from PybalBackendDown
Change 844429 merged by Filippo Giunchedi:
[operations/alerts@master] sre: remove 'host' label from PybalBackendDown
Change 902690 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/alerts@master] sre: check pybal_monitor_down_results_total for PybalBackendDown
Change 902690 merged by jenkins-bot:
[operations/alerts@master] sre: check pybal_monitor_down_results_total for PybalBackendDown
Alert works well AFAICS, now at warning level, though it can be easily bumped to critical any time.