Page MenuHomePhabricator

Alert on individual pybal backend hosts being down for a long time
Closed, ResolvedPublic

Description

When discussing solutions to the parent task (i.e. flood of alerts on IRC during incidents) it became apparent that we still want to be aware of individual hosts being down (from pybal's perspective). Since overall service health is already monitored via network probes we can be a little more lax in how we alert in this case.

Event Timeline

Change 841905 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: test warning on pybal backends being down for long

https://gerrit.wikimedia.org/r/841905

Change 841905 merged by Filippo Giunchedi:

[operations/alerts@master] sre: test warning on pybal backends being down for long

https://gerrit.wikimedia.org/r/841905

Change 844429 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: remove 'host' label from PybalBackendDown

https://gerrit.wikimedia.org/r/844429

Change 844429 merged by Filippo Giunchedi:

[operations/alerts@master] sre: remove 'host' label from PybalBackendDown

https://gerrit.wikimedia.org/r/844429

Change 902690 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: check pybal_monitor_down_results_total for PybalBackendDown

https://gerrit.wikimedia.org/r/902690

Change 902690 merged by jenkins-bot:

[operations/alerts@master] sre: check pybal_monitor_down_results_total for PybalBackendDown

https://gerrit.wikimedia.org/r/902690

Alert works well AFAICS, now at warning level, though it can be easily bumped to critical any time.