Description
Related Objects
Event Timeline
Note that we currently have such an alert (or at least something close to it).
The code itself in pybal (icinga is already configured to alert, it's the check_pybal_backends alert) is at https://github.com/wikimedia/PyBal/blob/942a31290326a635294a905e19d8cef2e7cc6181/pybal/instrumentation.py#L66. We can adapt it to do what we want. However from a cursory reading of the code, we seem to have already what we want? Or am I missing something?
What I would have liked to see is some alert that over half our cache capacity in eqiad was configured-depooled, as that was the case for about 24h. That shouldn't be a normal situation.
The alert you reference did fire... but only around Feb 11th 21:02 -- 21:07, when cp1089 was the sole remaining pooled cp-text and then started failing under load.
✔️ cdanis@icinga1001.wikimedia.org /var/log 🕘☕ zgrep -F 'PyBal backends health' syslog* | grep -F lvs10 | grep cp syslog.2.gz:Feb 11 21:02:51 icinga1001 icinga: SERVICE ALERT: lvs1013;PyBal backends health check;CRITICAL;SOFT;1;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled syslog.2.gz:Feb 11 21:03:31 icinga1001 icinga: SERVICE ALERT: lvs1016;PyBal backends health check;CRITICAL;SOFT;1;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled syslog.2.gz:Feb 11 21:04:47 icinga1001 icinga: SERVICE ALERT: lvs1013;PyBal backends health check;CRITICAL;SOFT;2;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled syslog.2.gz:Feb 11 21:05:27 icinga1001 icinga: SERVICE ALERT: lvs1016;PyBal backends health check;CRITICAL;SOFT;2;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled syslog.2.gz:Feb 11 21:06:45 icinga1001 icinga: SERVICE ALERT: lvs1013;PyBal backends health check;CRITICAL;HARD;3;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled syslog.2.gz:Feb 11 21:06:45 icinga1001 icinga: SERVICE NOTIFICATION: irc;lvs1013;PyBal backends health check;CRITICAL;notify-service-by-irc;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled syslog.2.gz:Feb 11 21:07:31 icinga1001 icinga: SERVICE ALERT: lvs1016;PyBal backends health check;CRITICAL;HARD;3;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled syslog.2.gz:Feb 11 21:07:31 icinga1001 icinga: SERVICE NOTIFICATION: irc;lvs1016;PyBal backends health check;CRITICAL;notify-service-by-irc;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled
From IRC
akosiaris: ok, I think I am understanding now what you want to see as an alert. pybal /alerts alerts on the "operational" side of it, you want an alert based on the "configuration" side of it cdanis: yes
So, the pybal /alerts won't cover this one, we need to code something.