Page MenuHomePhabricator

cloudsw1-d5-eqiad instability Aug 6 2024
Closed, ResolvedPublic

Description

Since 06:10 this morning cloudsw1-d5-eqiad has been exhibiting control-plane issues which is causing instability for traffic to hosts connected to that rack, as well as for the cloud networks in eqiad generally.

This was noticed by @dcaro first as occasional packet loss between hosts

image.png (287×1 px, 37 KB)

Testing with ping it was difficult to replicate, one finding was that we tended to miss a bunch (5-10) pings in a row followed by restoration, but things were generally very stable when pinging. Checking further on network links there are some outbound packet discards, but these have no significantly changed in the past 24 hours, nor are there any interface errors.

Control Plane issues

Taking a closer look at the various switches we discovered that BGP sessions were regularly flapping on cloudsw1-d5-eqiad. This seemed to be happening regularly since 06:10 UTC this morning. Additionally the ae0/ae1 interfaces were flapping up/down, however the physical bundle members were stable - so again suggesting a control plane issue.

Aug  6 06:10:43  cloudsw1-d5-eqiad /kernel: tcp_timer_keep: Dropping socket connection due to keepalive timer expiration, idle/intvl/cnt: 1000/1000/5
Aug  6 06:10:43  cloudsw1-d5-eqiad /kernel: tcp_timer_keep:Local(0x80000010:54938) Foreign(0x80000001:21787)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.64.147.12 (Internal AS 64710) changed state from Established to Idle (event RecvNotify) (instance master)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: bgp_handle_notify:4274: NOTIFICATION received from 10.64.147.12 (Internal AS 64710): code 6 (Cease) subcode 9 (Hard Reset)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: bgp_pp_recv: rejecting connection from 172.31.255.7 (External AS 4264710004), peer in state Established
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: bgp_pp_recv:4826: NOTIFICATION sent to 172.31.255.7+52450 (proto): code 6 (Cease) subcode 5 (Connection Rejected)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: bgp_pp_recv: rejecting connection from 10.64.147.5 (External AS 4264710003), peer in state Established
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: bgp_pp_recv:4826: NOTIFICATION sent to 10.64.147.5+52607 (proto): code 6 (Cease) subcode 5 (Connection Rejected)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 185.15.56.248 (Internal AS 64710) changed state from Established to Idle (event RecvNotify) (instance cloud)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: bgp_handle_notify:4274: NOTIFICATION received from 185.15.56.248 (Internal AS 64710): code 6 (Cease) subcode 9 (Hard Reset)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.64.147.5 (External AS 4264710003) changed state from Established to Idle (event RecvNotify) (instance master)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: bgp_handle_notify:4274: NOTIFICATION received from 10.64.147.5 (External AS 4264710003): code 6 (Cease) subcode 9 (Hard Reset)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 172.20.2.2 (External AS 64605) changed state from Established to Idle (event RecvNotify) (instance cloud)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: bgp_handle_notify:4274: NOTIFICATION received from 172.20.2.2 (External AS 64605): code 6 (Cease)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 172.20.2.4 (External AS 64605) changed state from Established to Idle (event RecvNotify) (instance cloud)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: bgp_handle_notify:4274: NOTIFICATION received from 172.20.2.4 (External AS 64605): code 6 (Cease)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 2620:0:861:fe0e::2 (External AS 4264710003) changed state from Established to Idle (event RecvNotify) (instance master)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: bgp_handle_notify:4274: NOTIFICATION received from 2620:0:861:fe0e::2 (External AS 4264710003): code 6 (Cease) subcode 9 (Hard Reset)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 172.31.255.7 (External AS 4264710004) changed state from Established to Idle (event RecvNotify) (instance cloud)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: bgp_handle_notify:4274: NOTIFICATION received from 172.31.255.7 (External AS 4264710004): code 6 (Cease) subcode 9 (Hard Reset)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 2620:0:861:fe0b::1 (Internal AS 64710) changed state from Established to Idle (event RecvNotify) (instance master)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: bgp_handle_notify:4274: NOTIFICATION received from 2620:0:861:fe0b::1 (Internal AS 64710): code 6 (Cease) subcode 9 (Hard Reset)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 172.31.255.5 (External AS 4264710003) changed state from Established to Idle (event RecvNotify) (instance cloud)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: bgp_handle_notify:4274: NOTIFICATION received from 172.31.255.5 (External AS 4264710003): code 6 (Cease) subcode 9 (Hard Reset)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.64.147.7 (External AS 4264710004) changed state from Established to Idle (event RecvNotify) (instance master)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: bgp_handle_notify:4274: NOTIFICATION received from 10.64.147.7 (External AS 4264710004): code 6 (Cease) subcode 9 (Hard Reset)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 2620:0:861:fe0f::2 (External AS 4264710004) changed state from Established to Idle (event RecvNotify) (instance master)
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: bgp_handle_notify:4274: NOTIFICATION received from 2620:0:861:fe0f::2 (External AS 4264710004): code 6 (Cease) subcode 9 (Hard Reset)
Aug  6 06:10:55  cloudsw1-d5-eqiad bfdd[2322]: BFDD_STATE_UP_TO_DOWN: BFD Session 2620:0:861:fe0e::2 (IFL 622) state Up -> Down LD/RD(41/33) Up time:7w6d 14:35 Local diag: NbrSignal Remote diag: CtlExpire Reason: Received DOWN from PEER.
Aug  6 06:10:55  cloudsw1-d5-eqiad /kernel: lag_bundlestate_ifd_change: bundle ae1: bundle IFD minimum bandwidth or minimum links not met, Bandwidth (Current : Required) 0 : 40000000000 Number of links (Current : Required) 0 : 1 
Aug  6 06:10:55  cloudsw1-d5-eqiad bfdd[2322]: BFDD_TRAP_SHOP_STATE_DOWN: local discriminator: 41, new state: down, interface: irb.1110, peer addr: 2620:0:861:fe0e::2
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: STP handler: IFD=ae1, op=change, state=Discarding, Topo change generation=0
Aug  6 06:10:55  cloudsw1-d5-eqiad /kernel: lag_bundlestate_ifd_change: bundle ae0: bundle IFD minimum bandwidth or minimum links not met, Bandwidth (Current : Required) 0 : 40000000000 Number of links (Current : Required) 0 : 1 
Aug  6 06:10:55  cloudsw1-d5-eqiad bfdd[2322]: BFDD_STATE_UP_TO_DOWN: BFD Session 10.64.147.12 (IFL 621) state Up -> Down LD/RD(74/68) Up time:2w4d 09:20 Local diag: NbrSignal Remote diag: CtlExpire Reason: Received DOWN from PEER.
Aug  6 06:10:55  cloudsw1-d5-eqiad rpd[2317]: *STP Change*, notify to other modules

CPU graphs for the switch don't show any issue or significant change:

image.png (635×1 px, 244 KB)

Looking at both the BFD flaps and AE flaps they were happening approx every 5 mins. In terms of CPU usage most things looked normal, fxpc was always around 30-40% but that seems usual. Other processes not hogging cpu.

To try to improve stability I disabled BFD on the IBGP group, and increased the timers from 1 to 5 second for the others. I also changed the 'LACP' keepalive mode from 'fast' to 'slow'. Both these have seemed to improve the situation, we no longer see flapping BFD sessions or interfaces, which has kept BGP stable:

cmooney@cloudsw1-d5-eqiad> show bgp summary | match "^[0-9]"    
Aug 06 10:00:29
10.64.147.5      4264710003        133        217       0      33       59:04 Establ
10.64.147.7      4264710004        166        266       0      32     1:14:16 Establ
10.64.147.12          64710        159        158       0      54       29:02 Establ
10.64.147.14          14907     986487     903106       0       0 41w3d 18:32:51 Establ
172.20.2.2            64605        170        158       0      42     1:12:43 Establ
172.20.2.4            64605        179        160       0      43     1:13:25 Establ
172.31.255.5     4264710003        133        172       0      36       59:00 Establ
172.31.255.7     4264710004        165        226       0      33     1:14:12 Establ
185.15.56.248         64710         69         68       0      54       28:34 Establ
208.80.154.212        14907     827000     869050       0       1 39w6d 21:21:47 Establ
2620:0:861:fe0a::1       14907     889003     903017       0       0 41w3d 18:33:02 Establ
2620:0:861:fe0b::1       64710        146        146       0      53       28:56 Establ
2620:0:861:fe0e::2  4264710003        133        208       0      33       58:53 Establ
2620:0:861:fe0f::2  4264710004        165        261       0      32     1:14:08 Establ

I still think this switch is "sick" though there is no specific smoking gun in terms of a particular process. Ultimately I think we need to schedule a reboot here which should hopefully return it to normal operation and allow us to remove the workarounds we added. We can also upgrade JunOS on the device when we reboot (see T316544) which is probably worth doing if we are going to have to endure the downtime anyway.

Event Timeline

cmooney triaged this task as High priority.Aug 6 2024, 10:03 AM
cmooney created this task.

Just to update on the situation things remain stable since the changes earlier on.

cmooney@cloudsw1-d5-eqiad> show bgp summary | match "^[0-9]"    
10.64.147.5      4264710003        912        986       0      33     6:55:09 Establ
10.64.147.7      4264710004        946       1035       0      32     7:10:21 Establ
10.64.147.12          64710        937        937       0      54     6:25:07 Establ
10.64.147.14          14907     987227     903873       0       0 41w4d 0:28:56 Establ
172.20.2.2            64605        983        926       0      42     7:08:48 Establ
172.20.2.4            64605        990        928       0      43     7:09:30 Establ
172.31.255.5     4264710003        912        939       0      36     6:55:05 Establ
172.31.255.7     4264710004        946        995       0      33     7:10:17 Establ
185.15.56.248         64710        841        836       0      54     6:24:39 Establ
208.80.154.212        14907     827730     869817       0       1 40w0d 3:17:52 Establ
2620:0:861:fe0a::1       14907     889734     903784       0       0 41w4d 0:29:07 Establ
2620:0:861:fe0b::1       64710        914        914       0      53     6:25:01 Establ
2620:0:861:fe0e::2  4264710003        912        977       0      33     6:54:58 Establ
2620:0:861:fe0f::2  4264710004        945       1030       0      32     7:10:13 Establ

We got paged at 20:19 UTC for "primary outbound port utilisation over 80%" on both cloudsw1-d5 and cloudsw1-f4 today.

Shortly after it resolved. But something nearly maxed out the 40GB link between them.

That's to be expected when moving lots of data around, we can try to be smarter and/or limit the backfill throughput (and/or add QoS, coming soon!), but it's ok to use the link as much as we can imo (as long as it behaves).

@Andrew pointed to https://grafana.wikimedia.org/goto/6Ay4ilrIg?orgId=1 (that someone had give him), that's the aggregated network traffic for the whole cluster, and you can see sudden peaks when a new osd comes online, as the cluster shuffles data around the whole cluster (from many to many nodes), then it slows down gently, when the data to move is only to the new member (many to one).

image.png (837×3 px, 213 KB)

We got paged at 20:19 UTC for "primary outbound port utilisation over 80%" on both cloudsw1-d5 and cloudsw1-f4 today.

That's to be expected when moving lots of data around, we can try to be smarter and/or limit the backfill throughput (and/or add QoS, coming soon!), but it's ok to use the link as much as we can imo (as long as it behaves).

Yeah, it is to be expected unfortunately. The paging is not great though, when we add the QoS stuff we can look at changing the rules to be more lenient about traffic in the scavenger class. We're planning to add that for the cloud ceph nodes (see T371501) but unfortunately it's not in place right now.

Closing this one, things have been ok since upgrade/reset.