Page MenuHomePhabricator

Icinga downtimes not working
Closed, ResolvedPublic

Description

During maintenance on db1181 "profile::monitoring::notifications_enabled: false" was set, but it still paged. Similar report by @RKemper on IRC from last night:

<ryankemper> Anyone been having trouble downtiming hosts in Icinga? For example I just tried manually downtiming `elastic2059` and never saw it actually take effect in Icinga
<ryankemper> Feels like Icinga is getting the request fine but then failing to actually process it in the backend
<ryankemper> ^ Following up I now see those downtimes in place so it does seem like the icinga backend is just really backlogged

Followups

  • (critical) Alert on "max check latency" shooting up
  • Depending on how frequently the above fires, consider auto-remediation (i.e. an Icinga restart)

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2022-08-02T06:46:20Z] <godog> bounce icinga on alert1001 - T314353

Indeed check max latency spiked up to 30+ min (!) around that time

Is this just a duplicate/continuation of T196336 ?

Mentioned in SAL (#wikimedia-operations) [2022-08-02T07:22:36Z] <godog> bounce icinga on alert2001 - T314353

Is this just a duplicate/continuation of T196336 ?

It is definitely possible, I don't know ATM

A restart of Icinga on alert1001 brought things back AFAICT, the "max check latency" (which is so far the one/only signal I found of sth being wrong) went back down within ~30 min of Icinga restarting and it is now around 60s as usual.

Indeed check max latency spiked up to 30+ min (!) around that time

I feel like T196336 is/was also caused by general latency.

Change 820072 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] o11y: alert on Icinga max check latency

https://gerrit.wikimedia.org/r/820072

Change 820072 merged by Filippo Giunchedi:

[operations/alerts@master] o11y: alert on Icinga max check latency

https://gerrit.wikimedia.org/r/820072

fgiunchedi changed the task status from Open to Stalled.Aug 4 2022, 8:01 AM
fgiunchedi updated the task description. (Show Details)

We are now alerting on elevated max check latency, I'm going to stall the task and re-evaluate in a couple of months if we need to deploy auto-remediation as well.

lmata triaged this task as Medium priority.Sep 6 2022, 6:00 PM

max check alert latency did shoot up over last weekend (though to 7min) and self-recovered as far as I can tell:

2022-10-10-100445_1176x327_scrot.png (327×1 px, 39 KB)

fgiunchedi claimed this task.

except for that spike it looks like check latency is under control (and going down, as we progressively remove more and more check from icinga). I'm optimistically resolving the task