Page MenuHomePhabricator

Create an automated alert for 'too many nodes depooled from a service'
Open, MediumPublic

Event Timeline

akosiaris triaged this task as Medium priority.Feb 13 2020, 11:07 AM
akosiaris added a project: serviceops-radar.
akosiaris subscribed.

Note that we currently have such an alert (or at least something close to it).

The code itself in pybal (icinga is already configured to alert, it's the check_pybal_backends alert) is at https://github.com/wikimedia/PyBal/blob/942a31290326a635294a905e19d8cef2e7cc6181/pybal/instrumentation.py#L66. We can adapt it to do what we want. However from a cursory reading of the code, we seem to have already what we want? Or am I missing something?

We can adapt it to do what we want. However from a cursory reading of the code, we seem to have already what we want? Or am I missing something?

What I would have liked to see is some alert that over half our cache capacity in eqiad was configured-depooled, as that was the case for about 24h. That shouldn't be a normal situation.

The alert you reference did fire... but only around Feb 11th 21:02 -- 21:07, when cp1089 was the sole remaining pooled cp-text and then started failing under load.

✔️ cdanis@icinga1001.wikimedia.org /var/log 🕘☕ zgrep -F 'PyBal backends health' syslog* | grep -F lvs10 | grep cp
syslog.2.gz:Feb 11 21:02:51 icinga1001 icinga: SERVICE ALERT: lvs1013;PyBal backends health check;CRITICAL;SOFT;1;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled
syslog.2.gz:Feb 11 21:03:31 icinga1001 icinga: SERVICE ALERT: lvs1016;PyBal backends health check;CRITICAL;SOFT;1;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled
syslog.2.gz:Feb 11 21:04:47 icinga1001 icinga: SERVICE ALERT: lvs1013;PyBal backends health check;CRITICAL;SOFT;2;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled
syslog.2.gz:Feb 11 21:05:27 icinga1001 icinga: SERVICE ALERT: lvs1016;PyBal backends health check;CRITICAL;SOFT;2;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled
syslog.2.gz:Feb 11 21:06:45 icinga1001 icinga: SERVICE ALERT: lvs1013;PyBal backends health check;CRITICAL;HARD;3;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled
syslog.2.gz:Feb 11 21:06:45 icinga1001 icinga: SERVICE NOTIFICATION: irc;lvs1013;PyBal backends health check;CRITICAL;notify-service-by-irc;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled
syslog.2.gz:Feb 11 21:07:31 icinga1001 icinga: SERVICE ALERT: lvs1016;PyBal backends health check;CRITICAL;HARD;3;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled
syslog.2.gz:Feb 11 21:07:31 icinga1001 icinga: SERVICE NOTIFICATION: irc;lvs1016;PyBal backends health check;CRITICAL;notify-service-by-irc;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled

From IRC

akosiaris: ok, I think I am understanding now what you want to see as an alert. pybal /alerts alerts on the "operational" side of it, you want an alert based on the "configuration" side of it
cdanis: yes

So, the pybal /alerts won't cover this one, we need to code something.

Removing SRE, has already been triaged to a more specific SRE subteam

Aklapper added a subscriber: Joe.

@Joe: Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05-22.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome!
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!