The switch cloudsw1-c8-eqiad seems to be misbehaving.
Specifically starting at 02:18 UTC on Sep 4th we observed a period of significant instability, with flapping BFD sessions observed causing instability to BGP, affecting traffic routing between racks. This appeared to stop at about 06:10 after which BGP and BFD have been stable:
Sep 4 02:18:13 cloudsw1-c8-eqiad rpd[2318]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 172.31.255.3 (External AS 4264710004) changed state from Established to Idle (event RecvNotify) Sep 4 06:09:35 cloudsw1-c8-eqiad rpd[2318]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 172.20.1.5 (External AS 64605) changed state from OpenConfirm to Established (event RecvKeepAlive)
BFD timeout logs were present throughout and may be triggering the BGP sessions to go down:
Sep 4 02:18:14 cloudsw1-c8-eqiad bfdd[2323]: BFDD_STATE_UP_TO_DOWN: BFD Session 172.31.255.1 (IFL 634) state Up -> Down LD/RD(36/27) Up time:6w5d 05:28 Local diag: NbrSignal Remote diag: CtlExpire Reason: Received DOWN from PEER.
Looking at the cpu of the device we can see that during most of this period the CPU was spiking. This may just be increased use due to the new BGP sessions starting all the time (caused by BFD failing at lower level).
Overall this has different symptoms, but it is not unlike the incident caused by cloudsw1-d5-codfw on August 6th (T371879), where bfd sessions suddenly started failing. Unlike that the situation appears to have stabilized without intervention, however we can't say they aren't the same general type of problem. So I think, as with T371879, we probably should try to plan a switch outage here to allow us to power cycle and upgrade it.
TCP Timeouts
There are also strange logs showing constantly on this switch, which I don't see on, for-instance cloudsw1-d5-eqiad (although it is on more recent JunOS):
Sep 2 06:30:06 cloudsw1-c8-eqiad /kernel: tcp_timer_keep: Dropping socket connection due to keepalive timer expiration, idle/intvl/cnt: 1000/1000/5
Those are ongoing as far back as the logs go however, so I am not convinced they are related. From what I can tell they relate to these internal connections the switch is trying to make to itself, and never get a response, but I can't find what service is trying to do this or is making the connections. Could be a red herring.
cmooney@cloudsw1-c8-eqiad> show system connections | match 6997 tcp4 0 0 128.0.0.16.54563 128.0.0.1.6997 SYN_SENT
Servers in the rack
Servers in the rack: https://netbox.wikimedia.org/dcim/racks/24/