Page MenuHomePhabricator

Icinga paged for a host that should have been downtimed
Open, MediumPublic

Description

During an automated schema change, db1112 was downtimed but then a replica lag alert fired anyway.

Relevant SAL entries, all 2022-05-28:

18:41 ladsgroup@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance
18:41 ladsgroup@cumin1001: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance
18:41 ladsgroup@cumin1001: dbctl commit (dc=all): 'Depooling db1112 (T309311)', diff saved to https://phabricator.wikimedia.org/P28851 and previous config saved to /var/cache/conftool/dbconfig/20220528-184125-ladsgroup.json

So the six-hour downtime should have lasted until 00:41.

Then, it paged:

19:05:40 <+icinga-wm> PROBLEM - MariaDB Replica Lag: s3 #page on db1112 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1391.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica

When I opened https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=db1112&scroll=187 at 19:08, I saw the expected downtime:

image.png (295×1 px, 72 KB)

When @Marostegui looked at the same time, it appeared to him the host was not downtimed, then he refreshed and it was. None of us changed anything at the time.

Architecturally I can't think of any reason Icinga downtimes would be eventually-consistent or anything like that, but maybe I'm missing something -- is there any reason a page would still fire 24 minutes after the downtime began? Anything we can do to prevent it?

Event Timeline

Here the log from icinga's perspective on alert1001:

# grep db1112 /srv/icinga-logs/icinga-05-29-2022-00.log | perl -pe 's/\[(\d+)\]/localtime($1)/e'
...
Sat May 28 18:41:17 2022 EXTERNAL COMMAND: SCHEDULE_HOST_DOWNTIME;db1112;1653763275;1653784875;1;0;21600;ladsgroup@cumin1001;Maintenance
Sat May 28 18:41:17 2022 EXTERNAL COMMAND: SCHEDULE_HOST_SVC_DOWNTIME;db1112;1653763275;1653784875;1;0;21600;ladsgroup@cumin1001;Maintenance
Sat May 28 18:42:57 2022 SERVICE ALERT: db1112;MariaDB Replica IO: s3 #page;WARNING;SOFT;1;WARNING slave_io_state Slave_IO_Running: No
Sat May 28 18:43:39 2022 SERVICE ALERT: db1112;MariaDB Replica SQL: s3 #page;WARNING;SOFT;1;WARNING slave_sql_state Slave_SQL_Running: No
Sat May 28 18:45:11 2022 SERVICE ALERT: db1112;MariaDB Replica IO: s3 #page;WARNING;SOFT;2;WARNING slave_io_state Slave_IO_Running: No
Sat May 28 18:45:29 2022 SERVICE ALERT: db1112;MariaDB Replica Lag: s3 #page;WARNING;SOFT;1;WARNING slave_sql_lag Replication lag: 181.68 seconds
Sat May 28 18:45:51 2022 SERVICE ALERT: db1112;MariaDB Replica SQL: s3 #page;WARNING;SOFT;2;WARNING slave_sql_state Slave_SQL_Running: No
Sat May 28 18:47:23 2022 SERVICE ALERT: db1112;MariaDB Replica IO: s3 #page;WARNING;HARD;3;WARNING slave_io_state Slave_IO_Running: No
Sat May 28 18:47:41 2022 SERVICE ALERT: db1112;MariaDB Replica Lag: s3 #page;CRITICAL;SOFT;2;CRITICAL slave_sql_lag Replication lag: 314.31 seconds
Sat May 28 18:48:05 2022 SERVICE ALERT: db1112;MariaDB Replica SQL: s3 #page;WARNING;HARD;3;WARNING slave_sql_state Slave_SQL_Running: No
Sat May 28 18:49:39 2022 SERVICE ALERT: db1112;MariaDB sustained replica lag on s3;UNKNOWN;SOFT;1;NaN
Sat May 28 18:49:57 2022 SERVICE ALERT: db1112;MariaDB Replica Lag: s3 #page;CRITICAL;SOFT;3;CRITICAL slave_sql_lag Replication lag: 449.29 seconds
Sat May 28 18:51:49 2022 SERVICE ALERT: db1112;MariaDB sustained replica lag on s3;UNKNOWN;SOFT;2;NaN
Sat May 28 18:52:07 2022 SERVICE ALERT: db1112;MariaDB Replica Lag: s3 #page;CRITICAL;SOFT;4;CRITICAL slave_sql_lag Replication lag: 579.32 seconds
Sat May 28 18:54:03 2022 SERVICE ALERT: db1112;MariaDB sustained replica lag on s3;UNKNOWN;SOFT;3;NaN
Sat May 28 18:54:19 2022 SERVICE ALERT: db1112;MariaDB Replica Lag: s3 #page;CRITICAL;SOFT;5;CRITICAL slave_sql_lag Replication lag: 712.44 seconds
Sat May 28 18:56:17 2022 SERVICE ALERT: db1112;MariaDB sustained replica lag on s3;UNKNOWN;SOFT;4;NaN
Sat May 28 18:56:35 2022 SERVICE ALERT: db1112;MariaDB Replica Lag: s3 #page;CRITICAL;SOFT;6;CRITICAL slave_sql_lag Replication lag: 847.61 seconds
Sat May 28 18:58:33 2022 SERVICE ALERT: db1112;MariaDB sustained replica lag on s3;UNKNOWN;HARD;5;NaN
Sat May 28 18:58:51 2022 SERVICE ALERT: db1112;MariaDB Replica Lag: s3 #page;CRITICAL;SOFT;7;CRITICAL slave_sql_lag Replication lag: 983.48 seconds
Sat May 28 19:01:07 2022 SERVICE ALERT: db1112;MariaDB Replica Lag: s3 #page;CRITICAL;SOFT;8;CRITICAL slave_sql_lag Replication lag: 1118.47 seconds
Sat May 28 19:03:23 2022 SERVICE ALERT: db1112;MariaDB Replica Lag: s3 #page;CRITICAL;SOFT;9;CRITICAL slave_sql_lag Replication lag: 1254.48 seconds
Sat May 28 19:05:39 2022 SERVICE ALERT: db1112;MariaDB Replica Lag: s3 #page;CRITICAL;HARD;10;CRITICAL slave_sql_lag Replication lag: 1391.14 seconds
Sat May 28 19:05:39 2022 SERVICE NOTIFICATION: victorops;db1112;MariaDB Replica Lag: s3 #page;CRITICAL;vo-notify-by-email;CRITICAL slave_sql_lag Replication lag: 1391.14 seconds
...etc..list of contacts

Off the top of my head I can't think of any obvious reason why Icinga decided to send out the notification, for now I'm tempted to chalk it up to an Icinga bug/race condition

One random note. I think if a host is depooled (https://noc.wikimedia.org/dbconfig/eqiad.json), it shouldn't page under any condition. We have pages for user impact and something not pooled can't have a user impact. I know it's easier said than done though. Just a thought.

Off the top of my head I can't think of any obvious reason why Icinga decided to send out the notification, for now I'm tempted to chalk it up to an Icinga bug/race condition

The downtime command was never executed by Icinga (I have no idea why) because it didn't log the usual line:

icinga: HOST DOWNTIME ALERT: $HOSTNAME;STARTED; Host has entered a period of scheduled downtime

What we can do is to add polling in the cookbook to get the icinga status and return only when the status of the host is reported downtimed.
We could do the same for all the services too, if deemed necessary (to check that also the SCHEDULE_HOST_SVC_DOWNTIME command was applied), although would make the patch slightly more complex.
Let me know if you want me to make this patch.

Off the top of my head I can't think of any obvious reason why Icinga decided to send out the notification, for now I'm tempted to chalk it up to an Icinga bug/race condition

The downtime command was never executed by Icinga (I have no idea why) because it didn't log the usual line:

icinga: HOST DOWNTIME ALERT: $HOSTNAME;STARTED; Host has entered a period of scheduled downtime

Indeed, thank you for pointing that out, even more puzzling that folks did see the downtime on the UI (for the host it seems, not sure about the services)

What we can do is to add polling in the cookbook to get the icinga status and return only when the status of the host is reported downtimed.
We could do the same for all the services too, if deemed necessary (to check that also the SCHEDULE_HOST_SVC_DOWNTIME command was applied), although would make the patch slightly more complex.
Let me know if you want me to make this patch.

Since this is hopefully rare, personally I think we should focus on moving pages off Icinga altogether (i.e. T305847). Of course if we run into this problem again soon we'll need to reevaluate

Off the top of my head I can't think of any obvious reason why Icinga decided to send out the notification, for now I'm tempted to chalk it up to an Icinga bug/race condition

The downtime command was never executed by Icinga (I have no idea why) because it didn't log the usual line:

icinga: HOST DOWNTIME ALERT: $HOSTNAME;STARTED; Host has entered a period of scheduled downtime

Indeed, thank you for pointing that out, even more puzzling that folks did see the downtime on the UI (for the host it seems, not sure about the services)

Originally I was the only one that didn't see the downtime on the UI. But after a refresh, it showed up :-/

Since this is hopefully rare, personally I think we should focus on moving pages off Icinga altogether (i.e. T305847). Of course if we run into this problem again soon we'll need to reevaluate

As you want, just to make clear, the check for just the host being downtimed is a couple of lines of code, all the abstraction is already in spicerack. For the services, would be few more lines, but probably under 20 ;)

Since this is hopefully rare, personally I think we should focus on moving pages off Icinga altogether (i.e. T305847). Of course if we run into this problem again soon we'll need to reevaluate

As you want, just to make clear, the check for just the host being downtimed is a couple of lines of code, all the abstraction is already in spicerack. For the services, would be few more lines, but probably under 20 ;)

Thank you, if you have time/bandwidth then why not, I can review if needed!

MoritzMuehlenhoff subscribed.

Severity is unclear to me from just reading the task, but since we dislike unnecessary pages, tentatively setting priority to "High". But if this is a e.g. corner case we are unlikely to hit any time again soon, please lower as needed.

fgiunchedi lowered the priority of this task from High to Medium.Jun 6 2022, 8:20 AM

Severity is unclear to me from just reading the task, but since we dislike unnecessary pages, tentatively setting priority to "High". But if this is a e.g. corner case we are unlikely to hit any time again soon, please lower as needed.

Thank you Moritz, lowering to normal since we haven't experienced this very often AFAIK

Instead of adding a quick check in the downtime cookbook only I preferred to add the feature to spicerack directly so that all places using the downtime capabilities are automatically getting the same benefit.

Change 803317 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] icinga: ensure that the downtime was applied

https://gerrit.wikimedia.org/r/803317

Change 803317 merged by Volans:

[operations/software/spicerack@master] icinga: ensure that the downtime was applied

https://gerrit.wikimedia.org/r/803317

I was planning to close it when the new spicerack will be released with the patch... is not yet deployed to prod. But up to you.