Create an automated alert for 'too many nodes depooled from a service'
Open, MediumPublic
Actions

Assigned To

Authored By

	CDanis
	Feb 12 2020, 9:19 PM

Description

AI from https://wikitech.wikimedia.org/wiki/Incident_documentation/20200211-caching-proxies

Related Objects

Mentioned In: T245059: depool / confctl commands should print warnings or errors if too many nodes from that service are already depooled

Event Timeline

CDanis created this task.Feb 12 2020, 9:19 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 12 2020, 9:19 PM

Note that we currently have such an alert (or at least something close to it).

The code itself in pybal (icinga is already configured to alert, it's the check_pybal_backends alert) is at https://github.com/wikimedia/PyBal/blob/942a31290326a635294a905e19d8cef2e7cc6181/pybal/instrumentation.py#L66. We can adapt it to do what we want. However from a cursory reading of the code, we seem to have already what we want? Or am I missing something?

In T245058#5880313, @akosiaris wrote:

We can adapt it to do what we want. However from a cursory reading of the code, we seem to have already what we want? Or am I missing something?

What I would have liked to see is some alert that over half our cache capacity in eqiad was configured-depooled, as that was the case for about 24h. That shouldn't be a normal situation.

The alert you reference did fire... but only around Feb 11th 21:02 -- 21:07, when cp1089 was the sole remaining pooled cp-text and then started failing under load.

✔️ cdanis@icinga1001.wikimedia.org /var/log 🕘☕ zgrep -F 'PyBal backends health' syslog* | grep -F lvs10 | grep cp
syslog.2.gz:Feb 11 21:02:51 icinga1001 icinga: SERVICE ALERT: lvs1013;PyBal backends health check;CRITICAL;SOFT;1;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled
syslog.2.gz:Feb 11 21:03:31 icinga1001 icinga: SERVICE ALERT: lvs1016;PyBal backends health check;CRITICAL;SOFT;1;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled
syslog.2.gz:Feb 11 21:04:47 icinga1001 icinga: SERVICE ALERT: lvs1013;PyBal backends health check;CRITICAL;SOFT;2;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled
syslog.2.gz:Feb 11 21:05:27 icinga1001 icinga: SERVICE ALERT: lvs1016;PyBal backends health check;CRITICAL;SOFT;2;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled
syslog.2.gz:Feb 11 21:06:45 icinga1001 icinga: SERVICE ALERT: lvs1013;PyBal backends health check;CRITICAL;HARD;3;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled
syslog.2.gz:Feb 11 21:06:45 icinga1001 icinga: SERVICE NOTIFICATION: irc;lvs1013;PyBal backends health check;CRITICAL;notify-service-by-irc;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled
syslog.2.gz:Feb 11 21:07:31 icinga1001 icinga: SERVICE ALERT: lvs1016;PyBal backends health check;CRITICAL;HARD;3;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled
syslog.2.gz:Feb 11 21:07:31 icinga1001 icinga: SERVICE NOTIFICATION: irc;lvs1016;PyBal backends health check;CRITICAL;notify-service-by-irc;PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1089.eqiad.wmnet are marked down but pooled

From IRC

akosiaris: ok, I think I am understanding now what you want to see as an alert. pybal /alerts alerts on the "operational" side of it, you want an alert based on the "configuration" side of it
cdanis: yes

So, the pybal /alerts won't cover this one, we need to code something.

Krinkle moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.Apr 20 2020, 12:53 AM

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM