Page MenuHomePhabricator

Investigate cr2-eqord's disconnection from the rest of the network
Closed, ResolvedPublic0 Estimated Story Points

Description

cr2-eqord disappeared from the network earlier today (2019-05-29) for about 70mins (04:00-05:09). As far as I know this was not user-impacting, aside from perhaps increased latency for some North America destinations that we may reach only through eqord.

Root cause are a combination of:

  • Unplanned outage of one its transports, that remained unhandled for ~4 days;
  • Unplanned outage of the remaining two transports.

…resulting into a triple failure.

The timeline seems to be:

  • 2019-05-24 06:41: cr2-eqord<->cr1-eqiad transport stops carrying traffic, but interfaces remain up. Root cause is unknown, likely a vendor issue. OSPF & BFD checks notice and alert. Traffic reroutes (possibly adding latency for some routes?).
  • 2019-05-29 04:00 cr2-eqord<->cr3-ulsfo & cr2-eqord <->cr2-codfw interfaces go down. All transports are now inoperable, and cr2-eqord disappears from the network. No user impact.
  • 2019-05-29 04:23 Vendor issues "disturbance information" without us reaching out, acknowledging it's on their end, and suspect a "cable fault issue".
  • 2019-05-29 05:09 Service on cr2-eqord<->cr3-ulsfo & cr2-eqord <->cr2-codfw is restored. cr2-eqord becomes reachable again.
  • 2019-05-29 05:22 Faidon notices the alerts and begins investigating.
  • 2019-05-29 05:48 Vendor informs us of the service on those two wavelengths having been restored.
  • 2019-05-29 05:52 Faidon reaches out to vendor about the third wavelength (cr2-eqord<->cr1-eqiad transport)
  • 2019-05-29 06:13 Vendor acknowledges, begins investigating
  • 2019-05-29 07:13 Vendor confirms the issue since May 24th, "bounces the interface", which restores the service, restoring cr2-eqord's full network redundancy.

Event Timeline

faidon triaged this task as High priority.May 29 2019, 5:36 AM
faidon created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

So for the two that went down there was no planned maintenance, but we did get an email from the vendor ("00985243 Disturbance") suggesting that this was an unplanned event.

So, for the transport to eqiad, our calendar shows a maintenance going on right now (PWIC97234), but this is actually incorrect, as the maintenance text suggests this is scheduled for 2019-May-31 (I looked up the email too). Someone on our team predicted this better than the vendor could? :) The vendor doesn't seem to have noticed that one, I'll drop them an email.

OK, so the vendor "bounced the interface" and the eqiad<->eqord traffic has been restored. What they noticed -and I confirmed- is that this interface was not carrying traffic since May 24th.

This was actually noticed by our monitoring, alerts on May 24th:

06:41 PROBLEM - BFD status on cr2-eqord is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
06:41 PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status

(I'll update the task description with a timeline)

As far as I can tell, these alerts remained outstanding until today, when they finally recovered. So it seems the alerts have been outstanding for ~4 days, without us noticing as far as I know.

Also, I think the cr2-eqord host DOWN alert did not page; given how a) infrequent b) usually serious "router down" alerts are, I wonder if this is something we should change?

Let's think about improvements and how we can avoid a situation like this in the future -- I'll reassign for @ayounsi FYI & for his recommendations. Thanks!

1/ outstanding alert, I think this is due to the alert being triggered right before a 3 days weekend and people not paying enough attention to active Icinga alerts. This should be tackled at a SRE wide scope.

2/ Router down pages, I could see arguments for both yes and no. But as a router going down has at least some user facing impact (failovers, routing protocols re-convergence), could fail in a non clean way, and Faidon's points. I'd say yes to pages.

3/ For that specific case, we could add a last resort GRE tunnel between Chicago and Ashburn.

Change 514332 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Network monitoring, make core router down page

https://gerrit.wikimedia.org/r/514332

Mentioned in SAL (#wikimedia-operations) [2019-06-18T12:37:49Z] <XioNoX> merge puppet change to make all router down alerts paging - T224535

Change 514332 merged by Ayounsi:
[operations/puppet@production] Network monitoring, make core router down page

https://gerrit.wikimedia.org/r/514332

Opened T226158 for the tunnel. Everything else here is done.