Maniphest T311926

PROBLEM: Icinga on alert2001.wikimedia.org is CRITICAL
Open, HighPublic
Actions

Assigned To

None

Authored By

	jcrespo
	Jul 2 2022, 5:07 PM

Description

The alert "PROBLEM: Icinga on alert2001.wikimedia.org is CRITICAL" is known to have fired a couple of times a month in the past. I believe this is meta monitoring of icinga being up on both primary WMF sites from a third location (wikitech 3rd party provider). Because cross-site monitoring is more susceptible to network issues and the low occurrence of those I never gave them a lot of thought.

However, yesterday the alert went off 8 times in the space of 24 hours. I am creating this ticket for awareness, for 2 reasons:

Evaluate if the alerts are legitimate (did we lose redundancy of our current main alerting system, or is this a consequence of a failing realiability of wikitech server/3rd party service provider or network link). If there is an ongoing cause, notify the right team (dc ops, netops, observability, cloud, etc.) to try to increase the reliability of the service.
If the upstream issue cannot be resolved (e.g. it is outside of our control), consider increasing the amount of probes failed before sending a notification (e.g. increasing the amount of failed soft states, coordinating with a 4rth location before alerting, etc.)

This is, I think a low priority issue, as it is not currently causing any problem other than alert spam, but tracking it on a task as spam fatigue in the long term can cause people to ignore them. Feel free to close if this is a known issue and e.g. will be handled in a different way.

Related Objects
Search...

Status	Assigned	Task
Open	None	T294906 Puppet Improvements
Duplicate	jbond	T265138 Work required to prepare for puppet 7
Resolved	SLyngshede-WMF	T273673 replace all puppet crons with systemd timers
Open	None	T132324 Tracking and Reducing cron-spam to root@
Open	None	T311926 PROBLEM: Icinga on alert2001.wikimedia.org is CRITICAL

Event Timeline

jcrespo created this task.Jul 2 2022, 5:07 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 2 2022, 5:07 PM

jcrespo added a parent task: T132324: Tracking and Reducing cron-spam to root@ .Jul 2 2022, 5:08 PM

RhinosF1 subscribed.Jul 2 2022, 5:11 PM

The alert has fired 22 times over the weekend. AFAICT all of them around the :34 minute mark, that seems suspiciously close to the start time of the timer for sync-icinga-state.service that starts at the :33 minute mark.
I think it could potentially be caused by a slower sync or a longer time for Icinga to get restarted after the sync.

Thanks, @Volans, to me that would indicate a problem in the way the alert is setup/coordinated, and less of the infrastructure itself (even if latency is a factor too). Hopefully someone from the observability team can have a deeper look.

FYI this is still ongoing. We got 9 emails in the last 24 hours.

Looks like recoveries are consistently arriving 2m after the alert. I've increased the threshold from 3 checks with 10 second pause to 5 checks with 60s pause, let's see if this addresses the flapping.

wikitech-static:~# crontab -l | grep alert2001
*/2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga --tries 5 --sleep 60 alert2001.wikimedia.org

In T311926#8051744, @herron wrote:
Looks like recoveries are consistently arriving 2m after the alert. I've increased the threshold from 3 checks with 10 second pause to 5 checks with 60s pause, let's see if this addresses the flapping.
wikitech-static:~# crontab -l | grep alert2001
*/2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga --tries 5 --sleep 60 alert2001.wikimedia.org

@herron with the new setting we could have up to 3 checks in flight at the same time, because the crontab is running every 2 minutes. That's also an increase of a factor of 10 of the alert threshold.

From a quick check at my inbox the alert had fired:

1 time in the Oct-Dec 2021 period
3 times in the Jan-Mar 2022 period
9 times in the Apr-Jun 2022 period
32 times since Jul 4th early morning UTC

So it seems that there was a slight trend of increased alerts and then it exploded on the 4th. IMHO it should be useful to know what changed and why.

In addition to the dates above I see occasional flapping from this check in my inbox going all the way back to 2019.

The sync job (which stops icinga for the duration) kicks off each hour at :33 and takes >1m to complete currently. I think it's safe to say the 30s threshold used to check the active host is too short for the standby.

Happy to look deeper if the flapping persists, or the service remains in down state.

Updated to /5 from /2 to avoid multiple checks in flight.

I see that now the crontab entries are:

*/5 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga --tries 5 --sleep 60 alert2001.wikimedia.org
*/2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga alert1001.wikimedia.org

This means that if we failover alert1001 to alert2001, we'll probably start getting the alerts for alert1001 and also would mean that we'll allow the active Icinga service (alert2001 at that point) to be potentially down for up to 10 minutes before alerting.

In general I think that we have 2 distinct problems here:

1) Apparently something changed in the sync process starting on July 4th so that it triggers the alert much more often.

As for the first one, it's not immediately clear from the logs. I've grepped some of the usual Icinga messages on restart and didn't notice any difference before/after July 4th. See the output of commands like:

zgrep 'Icinga 1.14.2 starting' /var/log/syslog* | cut -d ":" -f2- | sort | egrep ' [0-9]{2}:3[0-9]:'
zgrep 'Finished daemonizing' /var/log/syslog* | cut -d ":" -f2- | sort | egrep ' [0-9]{2}:3[0-9]:'
zgrep 'Event loop started' /var/log/syslog* | cut -d ":" -f2- | sort | egrep ' [0-9]{2}:3[0-9]:'

2) The check is prone to generate false positive alerts during the sync process.

We can solve this in many ways as we control both systems. For example:

Reduce the icinga downtime time changing the script from doing: stop -> rsync -> start to do instead rsync to temporary location -> stop -> mv -> start
Have the sync script drop a static file in the apache document root with something like MAINTENANCE STARTED ${TIMESTAMP} that the check_icinga script on wikitech-static could check and skip the check entirely if the maintenance is ON and the timestamp is not too old. The sync script will then remove the file only at the end of the maintenance (or change its content to a different text).

My 2 cents

lmata added a project: SRE Observability (FY2022/2023-Q1).Aug 3 2022, 10:27 PM

lmata moved this task from Inbox to Prioritized on the Observability-Alerting board.Sep 6 2022, 6:58 PM

lmata moved this task from Inbox to Up next on the SRE Observability (FY2022/2023-Q1) board.

lmata edited projects, added SRE Observability (FY2022/2023-Q2); removed SRE Observability (FY2022/2023-Q1).Oct 28 2022, 6:30 PM

lmata moved this task from Inbox to Up next on the SRE Observability (FY2022/2023-Q2) board.

lmata removed a project: SRE Observability (FY2022/2023-Q2).Jan 11 2023, 10:30 PM

lmata moved this task from Prioritized to Backlog on the Observability-Alerting board.

PROBLEM: Icinga on alert2001.wikimedia.org is CRITICALOpen, HighPublicActions