Page MenuHomePhabricator

PROBLEM: Icinga on alert2001.wikimedia.org is CRITICAL
Open, HighPublic

Description

The alert "PROBLEM: Icinga on alert2001.wikimedia.org is CRITICAL" is known to have fired a couple of times a month in the past. I believe this is meta monitoring of icinga being up on both primary WMF sites from a third location (wikitech 3rd party provider). Because cross-site monitoring is more susceptible to network issues and the low occurrence of those I never gave them a lot of thought.

However, yesterday the alert went off 8 times in the space of 24 hours. I am creating this ticket for awareness, for 2 reasons:

  • Evaluate if the alerts are legitimate (did we lose redundancy of our current main alerting system, or is this a consequence of a failing realiability of wikitech server/3rd party service provider or network link). If there is an ongoing cause, notify the right team (dc ops, netops, observability, cloud, etc.) to try to increase the reliability of the service.
  • If the upstream issue cannot be resolved (e.g. it is outside of our control), consider increasing the amount of probes failed before sending a notification (e.g. increasing the amount of failed soft states, coordinating with a 4rth location before alerting, etc.)

This is, I think a low priority issue, as it is not currently causing any problem other than alert spam, but tracking it on a task as spam fatigue in the long term can cause people to ignore them. Feel free to close if this is a known issue and e.g. will be handled in a different way.

Event Timeline

Volans subscribed.

The alert has fired 22 times over the weekend. AFAICT all of them around the :34 minute mark, that seems suspiciously close to the start time of the timer for sync-icinga-state.service that starts at the :33 minute mark.
I think it could potentially be caused by a slower sync or a longer time for Icinga to get restarted after the sync.

Thanks, @Volans, to me that would indicate a problem in the way the alert is setup/coordinated, and less of the infrastructure itself (even if latency is a factor too). Hopefully someone from the observability team can have a deeper look.

Volans triaged this task as High priority.Jul 5 2022, 12:20 PM

FYI this is still ongoing. We got 9 emails in the last 24 hours.

Looks like recoveries are consistently arriving 2m after the alert. I've increased the threshold from 3 checks with 10 second pause to 5 checks with 60s pause, let's see if this addresses the flapping.

wikitech-static:~# crontab -l | grep alert2001
*/2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga --tries 5 --sleep 60 alert2001.wikimedia.org

Looks like recoveries are consistently arriving 2m after the alert. I've increased the threshold from 3 checks with 10 second pause to 5 checks with 60s pause, let's see if this addresses the flapping.

wikitech-static:~# crontab -l | grep alert2001
*/2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga --tries 5 --sleep 60 alert2001.wikimedia.org

@herron with the new setting we could have up to 3 checks in flight at the same time, because the crontab is running every 2 minutes. That's also an increase of a factor of 10 of the alert threshold.

From a quick check at my inbox the alert had fired:

  • 1 time in the Oct-Dec 2021 period
  • 3 times in the Jan-Mar 2022 period
  • 9 times in the Apr-Jun 2022 period
  • 32 times since Jul 4th early morning UTC

So it seems that there was a slight trend of increased alerts and then it exploded on the 4th. IMHO it should be useful to know what changed and why.

In addition to the dates above I see occasional flapping from this check in my inbox going all the way back to 2019.

The sync job (which stops icinga for the duration) kicks off each hour at :33 and takes >1m to complete currently. I think it's safe to say the 30s threshold used to check the active host is too short for the standby.

Happy to look deeper if the flapping persists, or the service remains in down state.

Updated to /5 from /2 to avoid multiple checks in flight.

I see that now the crontab entries are:

*/5 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga --tries 5 --sleep 60 alert2001.wikimedia.org
*/2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga alert1001.wikimedia.org

This means that if we failover alert1001 to alert2001, we'll probably start getting the alerts for alert1001 and also would mean that we'll allow the active Icinga service (alert2001 at that point) to be potentially down for up to 10 minutes before alerting.

In general I think that we have 2 distinct problems here:

1) Apparently something changed in the sync process starting on July 4th so that it triggers the alert much more often.

As for the first one, it's not immediately clear from the logs. I've grepped some of the usual Icinga messages on restart and didn't notice any difference before/after July 4th. See the output of commands like:

zgrep 'Icinga 1.14.2 starting' /var/log/syslog* | cut -d ":" -f2- | sort | egrep ' [0-9]{2}:3[0-9]:'
zgrep 'Finished daemonizing' /var/log/syslog* | cut -d ":" -f2- | sort | egrep ' [0-9]{2}:3[0-9]:'
zgrep 'Event loop started' /var/log/syslog* | cut -d ":" -f2- | sort | egrep ' [0-9]{2}:3[0-9]:'
2) The check is prone to generate false positive alerts during the sync process.

We can solve this in many ways as we control both systems. For example:

  • Reduce the icinga downtime time changing the script from doing: stop -> rsync -> start to do instead rsync to temporary location -> stop -> mv -> start
  • Have the sync script drop a static file in the apache document root with something like MAINTENANCE STARTED ${TIMESTAMP} that the check_icinga script on wikitech-static could check and skip the check entirely if the maintenance is ON and the timestamp is not too old. The sync script will then remove the file only at the end of the maintenance (or change its content to a different text).

My 2 cents