The alert "PROBLEM: Icinga on alert2001.wikimedia.org is CRITICAL" is known to have fired a couple of times a month in the past. I believe this is meta monitoring of icinga being up on both primary WMF sites from a third location (wikitech 3rd party provider). Because cross-site monitoring is more susceptible to network issues and the low occurrence of those I never gave them a lot of thought.
However, yesterday the alert went off 8 times in the space of 24 hours. I am creating this ticket for awareness, for 2 reasons:
- Evaluate if the alerts are legitimate (did we lose redundancy of our current main alerting system, or is this a consequence of a failing realiability of wikitech server/3rd party service provider or network link). If there is an ongoing cause, notify the right team (dc ops, netops, observability, cloud, etc.) to try to increase the reliability of the service.
- If the upstream issue cannot be resolved (e.g. it is outside of our control), consider increasing the amount of probes failed before sending a notification (e.g. increasing the amount of failed soft states, coordinating with a 4rth location before alerting, etc.)
This is, I think a low priority issue, as it is not currently causing any problem other than alert spam, but tracking it on a task as spam fatigue in the long term can cause people to ignore them. Feel free to close if this is a known issue and e.g. will be handled in a different way.