Today we had a brief interruption in connectivity from production networks to direct management interfaces connected via mr1-eqiad / msw1-eqiad.
Root cause was a failure on the GigE copper link from asw2-a8-eqiad ge-8/0/10 to mr1-eqiad ge-0/0/1 (https://netbox.wikimedia.org/dcim/cables/1621/). This link is the main connection from production subnets into management.
Unfortunately the Juniper devices on both sides rolled over their logs before they could be retrieved on the command line. Additionally, because syslogs sent by the devices use this link the logs are not available in logstash/Kibana. The last logs there for affected devices stop at 14:29:40.
Both devices show that the flap happened on this link though, recovering at 14:36:58:
root@mr1-eqiad> show interfaces ge-0/0/1 | match "^Phy|flapped"
Physical interface: ge-0/0/1, Enabled, Physical link is Up
Last flapped : 2021-08-13 14:36:58 UTC (00:50:26 ago)
cmooney@asw2-a-eqiad> show interfaces ge-8/0/10 | match "^Physical|flapped"
Physical interface: ge-8/0/10, Enabled, Physical link is Up
Last flapped : 2021-08-13 14:36:58 UTC (00:50:46 ago)
DC Ops have confirmed nobody was on site at this time. Also noted that Daniel Zahn was installing new mw servers (mw145[3-6]) which are connected to asw2-a8-eqiad, at the same time. I fail to see how they could have made the other link fail, but recording it just in case.
Opening this ticket to record the incident (it is a worrying thing to just happen,) and also discuss the setup in general.
A few things strike me thinking about what happened:
- Although management isn't missing-critical, the blast radius (in terms of alerts firing) is quite large when it is unavailable.
- It would, perhaps, be better to have two routed links from mr routers to CR, to keep connectivity to the management network up if one link drops.
- Given how quickly the local JunOS logs roll over it makes the availability of remote syslogs more critical during/after incidents.