Page MenuHomePhabricator

Alert that should have paged via VictorOps was delayed because of partial networking outage
Open, MediumPublic

Description

During https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-10-22_eqiad_return_path_timeouts an alert that pages fired, but did not actually make it to VictorOps in time because of the partial networking outage that triggered the alert in the first place. The alert did make it to IRC, which was thankfully enough to get SRE eyes on it.

One idea that @CDanis suggested is to use the OOB network as a backup for sending these alerts to VO.

There was also some weirdness with the "BGP status on cr2-eqiad" alert in icinga, that issued a recovery, but never recorded it being down/in problem state. The tentative theory is that was affected by the same networking issues.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

AFAIK we're still alerting just sending emails to VO instead of using their API. As such we don't have any confirmation that the page was actually sent and no indication if we should retry or not.
I think that we should first migrate the alerting system to use their API so to have a real time confirmation that the page has been delivered to the VO infrastructure. We could then add to that mechanism any retry logic for transient failure and also fallback alternatives that we see fit in case of repeated failures like using a different path to reach VO API.

Icinga itself should check that can reach VO API periodically and expose this check to the external monitoring, so that the external monitoring will page also when Icinga it's reachable to it but can't reach VO API.

In this scenario in case of VO issues with their API we should get an email from the external monitoring and and an IRC alert from Icinga itself. That is why we should check VO API reachability periodically and not only when we get a page.

To clarify, the alert did make it to VO after a delay https://portal.victorops.com/ui/wikimedia/incident/1595/details. The alert message sat in the deferred queue due to a DNS lookup failure until the retry time was reached. Here's the related MX log:

2021-10-22 20:15:53 1me0xF-00BtlD-9v <= root@wikimedia.org H=alert1001.wikimedia.org [2620:0:861:3:208:80:154:88]:52394 I=[2620:0:861:3:208:80:154:76]:25 P=esmtp K S=1352 id=E1me0xF-0004ju-9A@alert1001.wikimedia.org
2021-10-22 20:15:56 1me0xF-00BtlD-9v == redacted+icinga@alert.victorops.com R=dnslookup defer (-1) DT=0s: host lookup did not complete
2021-10-22 20:26:32 1me0xF-00BtlD-9v == redacted+icinga@alert.victorops.com routing defer (-51) DT=0s: retry time not reached
2021-10-22 20:34:32 1me0xF-00BtlD-9v == redacted+icinga@alert.victorops.com routing defer (-51) DT=0s: retry time not reached
2021-10-22 20:50:00 1me0xF-00BtlD-9v == redacted+icinga@alert.victorops.com routing defer (-51) DT=0s: retry time not reached
2021-10-22 21:07:29 1me0xF-00BtlD-9v H=mxa.mailgun.org [52.38.190.177]: Remote host closed connection in response to initial connection
2021-10-22 21:07:30 1me0xF-00BtlD-9v => redacted+icinga@alert.victorops.com R=dnslookup T=remote_smtp_signed S=2056 H=mxa.mailgun.org [35.165.139.76] I=[208.80.154.76] X=TLS1.2:ECDHE_SECP256R1__RSA_SHA256__AES_256_GCM:256 CV=yes DN="C=US,ST=California,L=San Francisco,O=MAILGUN TECHNOLOGIES\, INC,OU=MAILGUN TECHNOLOGIES\, INC,CN=*.mailgun.org" C="250 Great success" DT=1s
2021-10-22 21:07:30 1me0xF-00BtlD-9v Completed

A few additional ideas for approaches to help ensure alerts are delivered in conditions like this:

  • Aggressively retry alert delivery to VO on both alert hosts and MXes as a "low hanging fruit" improvement to current setup
  • More robust (and independent from icinga) external monitoring
  • Active/active alerting with deduplication being performed by VO (multiple sites fire identical alerts, VO filters/combines them)

Change 734391 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] exim: aggressively retry messages to alert.victorops.com addresses

https://gerrit.wikimedia.org/r/734391

Change 734391 merged by Herron:

[operations/puppet@production] exim: aggressively retry messages to alert.victorops.com addresses

https://gerrit.wikimedia.org/r/734391

Change 735039 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] default_mail_relay: add VO retry config to production host MTAs

https://gerrit.wikimedia.org/r/735039

Change 735039 merged by Herron:

[operations/puppet@production] default_mail_relay: add VO retry config to production host MTAs

https://gerrit.wikimedia.org/r/735039

colewhite triaged this task as Medium priority.Nov 8 2021, 10:38 PM
herron renamed this task from Alert that should have paged did not reach VictorOps because of partial networking outage to Alert that should have paged via VictorOps was delayed because of partial networking outage.Dec 7 2021, 2:37 PM
fgiunchedi added a subscriber: fgiunchedi.

Removing the immediate o11y backlog/workboard as this has been mitigated and will be completely resolved once T305847: Migrate SRE paging alerts off Icinga and to Alertmanager is completed