Since 06:10 this morning cloudsw1-d5-eqiad has been exhibiting control-plane issues which is causing instability for traffic to hosts connected to that rack, as well as for the cloud networks in eqiad generally.
This was noticed by @dcaro first as occasional packet loss between hosts
Testing with ping it was difficult to replicate, one finding was that we tended to miss a bunch (5-10) pings in a row followed by restoration, but things were generally very stable when pinging. Checking further on network links there are some outbound packet discards, but these have no significantly changed in the past 24 hours, nor are there any interface errors.
Control Plane issues
Taking a closer look at the various switches we discovered that BGP sessions were regularly flapping on cloudsw1-d5-eqiad. This seemed to be happening regularly since 06:10 UTC this morning. Additionally the ae0/ae1 interfaces were flapping up/down, however the physical bundle members were stable - so again suggesting a control plane issue.
Aug 6 06:10:43 cloudsw1-d5-eqiad /kernel: tcp_timer_keep: Dropping socket connection due to keepalive timer expiration, idle/intvl/cnt: 1000/1000/5 Aug 6 06:10:43 cloudsw1-d5-eqiad /kernel: tcp_timer_keep:Local(0x80000010:54938) Foreign(0x80000001:21787) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.64.147.12 (Internal AS 64710) changed state from Established to Idle (event RecvNotify) (instance master) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: bgp_handle_notify:4274: NOTIFICATION received from 10.64.147.12 (Internal AS 64710): code 6 (Cease) subcode 9 (Hard Reset) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: bgp_pp_recv: rejecting connection from 172.31.255.7 (External AS 4264710004), peer in state Established Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: bgp_pp_recv:4826: NOTIFICATION sent to 172.31.255.7+52450 (proto): code 6 (Cease) subcode 5 (Connection Rejected) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: bgp_pp_recv: rejecting connection from 10.64.147.5 (External AS 4264710003), peer in state Established Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: bgp_pp_recv:4826: NOTIFICATION sent to 10.64.147.5+52607 (proto): code 6 (Cease) subcode 5 (Connection Rejected) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 185.15.56.248 (Internal AS 64710) changed state from Established to Idle (event RecvNotify) (instance cloud) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: bgp_handle_notify:4274: NOTIFICATION received from 185.15.56.248 (Internal AS 64710): code 6 (Cease) subcode 9 (Hard Reset) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.64.147.5 (External AS 4264710003) changed state from Established to Idle (event RecvNotify) (instance master) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: bgp_handle_notify:4274: NOTIFICATION received from 10.64.147.5 (External AS 4264710003): code 6 (Cease) subcode 9 (Hard Reset) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 172.20.2.2 (External AS 64605) changed state from Established to Idle (event RecvNotify) (instance cloud) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: bgp_handle_notify:4274: NOTIFICATION received from 172.20.2.2 (External AS 64605): code 6 (Cease) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 172.20.2.4 (External AS 64605) changed state from Established to Idle (event RecvNotify) (instance cloud) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: bgp_handle_notify:4274: NOTIFICATION received from 172.20.2.4 (External AS 64605): code 6 (Cease) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 2620:0:861:fe0e::2 (External AS 4264710003) changed state from Established to Idle (event RecvNotify) (instance master) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: bgp_handle_notify:4274: NOTIFICATION received from 2620:0:861:fe0e::2 (External AS 4264710003): code 6 (Cease) subcode 9 (Hard Reset) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 172.31.255.7 (External AS 4264710004) changed state from Established to Idle (event RecvNotify) (instance cloud) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: bgp_handle_notify:4274: NOTIFICATION received from 172.31.255.7 (External AS 4264710004): code 6 (Cease) subcode 9 (Hard Reset) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 2620:0:861:fe0b::1 (Internal AS 64710) changed state from Established to Idle (event RecvNotify) (instance master) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: bgp_handle_notify:4274: NOTIFICATION received from 2620:0:861:fe0b::1 (Internal AS 64710): code 6 (Cease) subcode 9 (Hard Reset) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 172.31.255.5 (External AS 4264710003) changed state from Established to Idle (event RecvNotify) (instance cloud) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: bgp_handle_notify:4274: NOTIFICATION received from 172.31.255.5 (External AS 4264710003): code 6 (Cease) subcode 9 (Hard Reset) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.64.147.7 (External AS 4264710004) changed state from Established to Idle (event RecvNotify) (instance master) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: bgp_handle_notify:4274: NOTIFICATION received from 10.64.147.7 (External AS 4264710004): code 6 (Cease) subcode 9 (Hard Reset) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 2620:0:861:fe0f::2 (External AS 4264710004) changed state from Established to Idle (event RecvNotify) (instance master) Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: bgp_handle_notify:4274: NOTIFICATION received from 2620:0:861:fe0f::2 (External AS 4264710004): code 6 (Cease) subcode 9 (Hard Reset) Aug 6 06:10:55 cloudsw1-d5-eqiad bfdd[2322]: BFDD_STATE_UP_TO_DOWN: BFD Session 2620:0:861:fe0e::2 (IFL 622) state Up -> Down LD/RD(41/33) Up time:7w6d 14:35 Local diag: NbrSignal Remote diag: CtlExpire Reason: Received DOWN from PEER. Aug 6 06:10:55 cloudsw1-d5-eqiad /kernel: lag_bundlestate_ifd_change: bundle ae1: bundle IFD minimum bandwidth or minimum links not met, Bandwidth (Current : Required) 0 : 40000000000 Number of links (Current : Required) 0 : 1 Aug 6 06:10:55 cloudsw1-d5-eqiad bfdd[2322]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 41, new state: down, interface: irb.1110, peer addr: 2620:0:861:fe0e::2 Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: STP handler: IFD=ae1, op=change, state=Discarding, Topo change generation=0 Aug 6 06:10:55 cloudsw1-d5-eqiad /kernel: lag_bundlestate_ifd_change: bundle ae0: bundle IFD minimum bandwidth or minimum links not met, Bandwidth (Current : Required) 0 : 40000000000 Number of links (Current : Required) 0 : 1 Aug 6 06:10:55 cloudsw1-d5-eqiad bfdd[2322]: BFDD_STATE_UP_TO_DOWN: BFD Session 10.64.147.12 (IFL 621) state Up -> Down LD/RD(74/68) Up time:2w4d 09:20 Local diag: NbrSignal Remote diag: CtlExpire Reason: Received DOWN from PEER. Aug 6 06:10:55 cloudsw1-d5-eqiad rpd[2317]: *STP Change*, notify to other modules
CPU graphs for the switch don't show any issue or significant change:
Looking at both the BFD flaps and AE flaps they were happening approx every 5 mins. In terms of CPU usage most things looked normal, fxpc was always around 30-40% but that seems usual. Other processes not hogging cpu.
To try to improve stability I disabled BFD on the IBGP group, and increased the timers from 1 to 5 second for the others. I also changed the 'LACP' keepalive mode from 'fast' to 'slow'. Both these have seemed to improve the situation, we no longer see flapping BFD sessions or interfaces, which has kept BGP stable:
cmooney@cloudsw1-d5-eqiad> show bgp summary | match "^[0-9]" Aug 06 10:00:29 10.64.147.5 4264710003 133 217 0 33 59:04 Establ 10.64.147.7 4264710004 166 266 0 32 1:14:16 Establ 10.64.147.12 64710 159 158 0 54 29:02 Establ 10.64.147.14 14907 986487 903106 0 0 41w3d 18:32:51 Establ 172.20.2.2 64605 170 158 0 42 1:12:43 Establ 172.20.2.4 64605 179 160 0 43 1:13:25 Establ 172.31.255.5 4264710003 133 172 0 36 59:00 Establ 172.31.255.7 4264710004 165 226 0 33 1:14:12 Establ 185.15.56.248 64710 69 68 0 54 28:34 Establ 208.80.154.212 14907 827000 869050 0 1 39w6d 21:21:47 Establ 2620:0:861:fe0a::1 14907 889003 903017 0 0 41w3d 18:33:02 Establ 2620:0:861:fe0b::1 64710 146 146 0 53 28:56 Establ 2620:0:861:fe0e::2 4264710003 133 208 0 33 58:53 Establ 2620:0:861:fe0f::2 4264710004 165 261 0 32 1:14:08 Establ
I still think this switch is "sick" though there is no specific smoking gun in terms of a particular process. Ultimately I think we need to schedule a reboot here which should hopefully return it to normal operation and allow us to remove the workarounds we added. We can also upgrade JunOS on the device when we reboot (see T316544) which is probably worth doing if we are going to have to endure the downtime anyway.