Today we had a brief interruption in connectivity from production networks to direct management interfaces connected via mr1-eqiad / msw1-eqiadNOTE: Description updated after closer inspection of logs indicated issue was device rather than link failure.
Root cause was a failure on the GigE copper link from asw2-a8-eqiad ge-8/0/10 to mr1-eqiad ge-0/0/1 (https://netbox.wikimedia.org/dcim/cables/1621/)Juniper EX4300-48T device ASW2-A8-EQIAD failed unexpectedly on Friday Aug 13th. This caused a total disconnection for all hosts connected to this device. This link is the main connection from production subnets into managementAs that device also connects mr1-eqiad, and thus production networks to our direct management subnet in eqiad, the outage also brought down all communications to our management subnet at this site.
Unfortunately the Juniper devices on both sides rolled over their logs before they could be retrieved on the command line. Additionally, because syslogs sent by the devices use this link the logs are not available in logstash/Kibana. The last logs there for affected devices stop at 14:29:40.
Both devices show that the flap happened on this link though, recovering at 14:36:58:
```
root@mr1-eqiad> show interfaces ge-0/0/1 | match "^Phy|flapped"
Physical interface: ge-0/0/1, Enabled, Physical link is Up
Last flapped : 2021-08-13 14:36:58 UTC (00:50:26 ago)
```
```
cmooney@asw2-a-eqiad> show interfaces ge-8/0/10 | match "^Physical|flapped"
Physical interface: ge-8/0/10, Enabled, Physical link is Up
Last flapped : 2021-08-13 14:36:58 UTC (00:50:46 ago)
```Duration appears to have been from ~14:29 to ~14:37
DC Ops have confirmed nobody was on site at this time. Also noted that Daniel Zahn was installing new mw servers (mw145[3-6]) which are connected to asw2-a8-eqiad, at the same time. I fail to see how they could have made the other link fail, but recording it just in case.
Opening this ticket to record the incident (it is a worrying thing to just happen,) and also discuss the setup in general.
A few things strike me thinking about what happened:
- Although management isn't missingon-critical, the blast radius (in terms of alerts firing) is quite large when it is unavailable.
- It would, perhaps, be better to have two routed links from mr routers to CR, to keep connectivity to the management network up if one link drops.
- Given how quickly the local JunOS logs roll over it makes the availability of remote syslogs more critical during/after incidents.