cr2-eqord disappeared from the network earlier today (2019-05-29) for about 70mins (04:00-05:09). As far as I know this was not user-impacting, aside from perhaps increased latency for some North America destinations that we may reach only through eqord.
Root cause are a combination of:
- Unplanned outage of one its transports, that remained unhandled for ~4 days;
- Unplanned outage of the remaining two transports.
…resulting into a triple failure.
The timeline seems to be:
- 2019-05-24 06:41: cr2-eqord<->cr1-eqiad transport stops carrying traffic, but interfaces remain up. Root cause is unknown, likely a vendor issue. OSPF & BFD checks notice and alert. Traffic reroutes (possibly adding latency for some routes?).
- 2019-05-29 04:00 cr2-eqord<->cr3-ulsfo & cr2-eqord <->cr2-codfw interfaces go down. All transports are now inoperable, and cr2-eqord disappears from the network. No user impact.
- 2019-05-29 04:23 Vendor issues "disturbance information" without us reaching out, acknowledging it's on their end, and suspect a "cable fault issue".
- 2019-05-29 05:09 Service on cr2-eqord<->cr3-ulsfo & cr2-eqord <->cr2-codfw is restored. cr2-eqord becomes reachable again.
- 2019-05-29 05:22 Faidon notices the alerts and begins investigating.
- 2019-05-29 05:48 Vendor informs us of the service on those two wavelengths having been restored.
- 2019-05-29 05:52 Faidon reaches out to vendor about the third wavelength (cr2-eqord<->cr1-eqiad transport)
- 2019-05-29 06:13 Vendor acknowledges, begins investigating
- 2019-05-29 07:13 Vendor confirms the issue since May 24th, "bounces the interface", which restores the service, restoring cr2-eqord's full network redundancy.