Page MenuHomePhabricator

Create an automated alert for 'too many nodes depooled from a service'
Open, MediumPublic

Event Timeline

akosiaris triaged this task as Medium priority.Feb 13 2020, 11:07 AM
akosiaris added a project: serviceops-radar.
akosiaris subscribed.

Note that we currently have such an alert (or at least something close to it).

The code itself in pybal (icinga is already configured to alert, it's the check_pybal_backends alert) is at https://github.com/wikimedia/PyBal/blob/942a31290326a635294a905e19d8cef2e7cc6181/pybal/instrumentation.py#L66. We can adapt it to do what we want. However from a cursory reading of the code, we seem to have already what we want? Or am I missing something?

We can adapt it to do what we want. However from a cursory reading of the code, we seem to have already what we want? Or am I missing something?

What I would have liked to see is some alert that over half our cache capacity in eqiad was configured-depooled, as that was the case for about 24h. That shouldn't be a normal situation.

The alert you reference did fire... but only around Feb 11th 21:02 -- 21:07, when cp1089 was the sole remaining pooled cp-text and then started failing under load.

✔️ cdanis@icinga1001.wikimedia.org /var/log 🕘☕ zgrep -F 'PyBal backends health' syslog* | grep -F lvs10 | grep cp
syslog.2.gz:Feb 11 21:02:51 icinga1001 icinga: SERVICE ALERT: lvs1013;PyBal backends health check;CRITICAL;SOFT;1;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled
syslog.2.gz:Feb 11 21:03:31 icinga1001 icinga: SERVICE ALERT: lvs1016;PyBal backends health check;CRITICAL;SOFT;1;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled
syslog.2.gz:Feb 11 21:04:47 icinga1001 icinga: SERVICE ALERT: lvs1013;PyBal backends health check;CRITICAL;SOFT;2;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled
syslog.2.gz:Feb 11 21:05:27 icinga1001 icinga: SERVICE ALERT: lvs1016;PyBal backends health check;CRITICAL;SOFT;2;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled
syslog.2.gz:Feb 11 21:06:45 icinga1001 icinga: SERVICE ALERT: lvs1013;PyBal backends health check;CRITICAL;HARD;3;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled
syslog.2.gz:Feb 11 21:06:45 icinga1001 icinga: SERVICE NOTIFICATION: irc;lvs1013;PyBal backends health check;CRITICAL;notify-service-by-irc;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled
syslog.2.gz:Feb 11 21:07:31 icinga1001 icinga: SERVICE ALERT: lvs1016;PyBal backends health check;CRITICAL;HARD;3;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled
syslog.2.gz:Feb 11 21:07:31 icinga1001 icinga: SERVICE NOTIFICATION: irc;lvs1016;PyBal backends health check;CRITICAL;notify-service-by-irc;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled

From IRC

akosiaris: ok, I think I am understanding now what you want to see as an alert. pybal /alerts alerts on the "operational" side of it, you want an alert based on the "configuration" side of it
cdanis: yes

So, the pybal /alerts won't cover this one, we need to code something.

Removing SRE, has already been triaged to a more specific SRE subteam