NOTE: Description updated after closer inspection of logs indicated issue was device rather than link failure.
Juniper EX4300-48T device ASW2-A8-EQIAD failed unexpectedly on Friday Aug 13th. This caused a total disconnection for all hosts connected to this device. As that device also connects mr1-eqiad, and thus production networks to our direct management subnet in eqiad, the outage also brought down all communications to our management subnet at this site.
Duration appears to have been from ~14:29 to ~14:37
DC Ops have confirmed nobody was on site at this time. Also noted that Daniel Zahn was installing new mw servers (mw145[3-6]) which are connected to asw2-a8-eqiad, at the same time. I fail to see how they could have made the other link fail, but recording it just in case.
Opening this ticket to record the incident (it is a worrying thing to just happen,) and also discuss the setup in general.
A few things strike me thinking about what happened:
- Although management isn't mission-critical, the blast radius (in terms of alerts firing) is quite large when it is unavailable.
- It would, perhaps, be better to have two routed links from mr routers to CR, to keep connectivity to the management network up if one link drops.