Page MenuHomePhabricator

Icinga downtimes not working
Open, Stalled, Needs TriagePublic

Description

During maintenance on db1181 "profile::monitoring::notifications_enabled: false" was set, but it still paged. Similar report by @RKemper on IRC from last night:

<ryankemper> Anyone been having trouble downtiming hosts in Icinga? For example I just tried manually downtiming `elastic2059` and never saw it actually take effect in Icinga
<ryankemper> Feels like Icinga is getting the request fine but then failing to actually process it in the backend
<ryankemper> ^ Following up I now see those downtimes in place so it does seem like the icinga backend is just really backlogged

Followups

  • (critical) Alert on "max check latency" shooting up
  • Depending on how frequently the above fires, consider auto-remediation (i.e. an Icinga restart)

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2022-08-02T06:46:20Z] <godog> bounce icinga on alert1001 - T314353

Indeed check max latency spiked up to 30+ min (!) around that time

Is this just a duplicate/continuation of T196336 ?

Mentioned in SAL (#wikimedia-operations) [2022-08-02T07:22:36Z] <godog> bounce icinga on alert2001 - T314353

Is this just a duplicate/continuation of T196336 ?

It is definitely possible, I don't know ATM

A restart of Icinga on alert1001 brought things back AFAICT, the "max check latency" (which is so far the one/only signal I found of sth being wrong) went back down within ~30 min of Icinga restarting and it is now around 60s as usual.

Indeed check max latency spiked up to 30+ min (!) around that time

I feel like T196336 is/was also caused by general latency.

Change 820072 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] o11y: alert on Icinga max check latency

https://gerrit.wikimedia.org/r/820072

Change 820072 merged by Filippo Giunchedi:

[operations/alerts@master] o11y: alert on Icinga max check latency

https://gerrit.wikimedia.org/r/820072

fgiunchedi changed the task status from Open to Stalled.Thu, Aug 4, 8:01 AM
fgiunchedi updated the task description. (Show Details)

We are now alerting on elevated max check latency, I'm going to stall the task and re-evaluate in a couple of months if we need to deploy auto-remediation as well.