Page MenuHomePhabricator

cloudsw1-c8-eqiad is unstable
Closed, ResolvedPublic

Description

The switch cloudsw1-c8-eqiad seems to be misbehaving.

Specifically starting at 02:18 UTC on Sep 4th we observed a period of significant instability, with flapping BFD sessions observed causing instability to BGP, affecting traffic routing between racks. This appeared to stop at about 06:10 after which BGP and BFD have been stable:

Sep  4 02:18:13  cloudsw1-c8-eqiad rpd[2318]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 172.31.255.3 (External AS 4264710004) changed state from Established to Idle (event RecvNotify)
Sep  4 06:09:35  cloudsw1-c8-eqiad rpd[2318]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 172.20.1.5 (External AS 64605) changed state from OpenConfirm to Established (event RecvKeepAlive)

BFD timeout logs were present throughout and may be triggering the BGP sessions to go down:

Sep  4 02:18:14  cloudsw1-c8-eqiad bfdd[2323]: BFDD_STATE_UP_TO_DOWN: BFD Session 172.31.255.1 (IFL 634) state Up -> Down LD/RD(36/27) Up time:6w5d 05:28 Local diag: NbrSignal Remote diag: CtlExpire Reason: Received DOWN from PEER.

Looking at the cpu of the device we can see that during most of this period the CPU was spiking. This may just be increased use due to the new BGP sessions starting all the time (caused by BFD failing at lower level).

image.png (362×1 px, 240 KB)

Overall this has different symptoms, but it is not unlike the incident caused by cloudsw1-d5-codfw on August 6th (T371879), where bfd sessions suddenly started failing. Unlike that the situation appears to have stabilized without intervention, however we can't say they aren't the same general type of problem. So I think, as with T371879, we probably should try to plan a switch outage here to allow us to power cycle and upgrade it.

TCP Timeouts

There are also strange logs showing constantly on this switch, which I don't see on, for-instance cloudsw1-d5-eqiad (although it is on more recent JunOS):

Sep  2 06:30:06  cloudsw1-c8-eqiad /kernel: tcp_timer_keep: Dropping socket connection due to keepalive timer expiration, idle/intvl/cnt: 1000/1000/5

Those are ongoing as far back as the logs go however, so I am not convinced they are related. From what I can tell they relate to these internal connections the switch is trying to make to itself, and never get a response, but I can't find what service is trying to do this or is making the connections. Could be a red herring.

cmooney@cloudsw1-c8-eqiad> show system connections | match 6997   
tcp4       0      0  128.0.0.16.54563                              128.0.0.1.6997                                SYN_SENT

Servers in the rack

Servers in the rack: https://netbox.wikimedia.org/dcim/racks/24/

Event Timeline

aborrero added a project: User-aborrero.

Related, we have to install one of the mons that's out of C8 to be able to drain the rack T374005: [ceph] install and put in the cluster the cloudcephmon100[1-3] replacements

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-04T16:35:30Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-04T19:35:48Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-05T12:47:37Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-05T12:51:36Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-05T16:48:51Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-05T17:32:07Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-05T21:32:31Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-06T07:27:05Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-06T11:13:15Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-06T13:46:04Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-06T17:58:37Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-06T18:17:37Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-06T23:18:16Z] <dcaro@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-09T09:15:39Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-09T15:04:20Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-09T15:05:49Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-09T21:32:08Z] <dcaro@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-10T08:45:29Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-10T08:46:36Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T373986)

Icinga downtime and Alertmanager silence (ID=6ca197a8-f8fb-4fad-8834-f4d89337e282) set by cmooney@cumin1002 for 1:30:00 on 3 host(s) and their services with reason: Reboot cloudsw1-c8-eqiad and upgrade JunOS

cloudsw1-c8-eqiad,cloudsw1-c8-eqiad IPv6,cloudsw1-c8-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=b046581a-9da3-41c8-8d93-e0eeee44732e) set by cmooney@cumin1002 for 1:30:00 on 1 host(s) and their services with reason: reboot cloudsw1-c8-eqiad

cr1-eqiad

Icinga downtime and Alertmanager silence (ID=5b4f18de-e6fc-4fd9-ace2-e29e419c51b0) set by cmooney@cumin1002 for 1:30:00 on 24 host(s) and their services with reason: reboot cloudsw1-c8-eqiad

cloudbackup1003.eqiad.wmnet,cloudcephmon[1003-1004].eqiad.wmnet,cloudcephosd[1006-1009,1016-1018,1021-1022,1035].eqiad.wmnet,cloudcontrol1005.eqiad.wmnet,cloudgw1001.eqiad.wmnet,cloudlb1001.eqiad.wmnet,cloudnet1005.eqiad.wmnet,cloudrabbit1001.eqiad.wmnet,cloudservices1006.eqiad.wmnet,cloudvirt[1031-1035].eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=dda4c44a-52fb-46f0-8e02-d2d4b5a2ee3e) set by cmooney@cumin1002 for 0:30:00 on 24 host(s) and their services with reason: reboot cloudsw1-c8-eqiad

cassandra-dev2003.codfw.wmnet,db[2213-2214].codfw.wmnet,es2023.codfw.wmnet,ganeti2016.codfw.wmnet,kubernetes[2022,2046-2047,2056].codfw.wmnet,maps2008.codfw.wmnet,mw[2278-2279,2366-2376].codfw.wmnet,pc2016.codfw.wmnet

The switch upgrade / reboot was successful earlier today which hopefully will mean we don't have any repeat of this incident. All protocols established and looking stable but we'll keep an eye on it.

Things seem stable with this now so I will close the task.