cloudsw1-c8-eqiad is unstable
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aborrero
	Sep 4 2024, 9:57 AM

Description

The switch cloudsw1-c8-eqiad seems to be misbehaving.

Specifically starting at 02:18 UTC on Sep 4th we observed a period of significant instability, with flapping BFD sessions observed causing instability to BGP, affecting traffic routing between racks. This appeared to stop at about 06:10 after which BGP and BFD have been stable:

Sep  4 02:18:13  cloudsw1-c8-eqiad rpd[2318]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 172.31.255.3 (External AS 4264710004) changed state from Established to Idle (event RecvNotify)
Sep  4 06:09:35  cloudsw1-c8-eqiad rpd[2318]: RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 172.20.1.5 (External AS 64605) changed state from OpenConfirm to Established (event RecvKeepAlive)

BFD timeout logs were present throughout and may be triggering the BGP sessions to go down:

Sep  4 02:18:14  cloudsw1-c8-eqiad bfdd[2323]: BFDD_STATE_UP_TO_DOWN: BFD Session 172.31.255.1 (IFL 634) state Up -> Down LD/RD(36/27) Up time:6w5d 05:28 Local diag: NbrSignal Remote diag: CtlExpire Reason: Received DOWN from PEER.

Looking at the cpu of the device we can see that during most of this period the CPU was spiking. This may just be increased use due to the new BGP sessions starting all the time (caused by BFD failing at lower level).

Overall this has different symptoms, but it is not unlike the incident caused by cloudsw1-d5-codfw on August 6th (T371879), where bfd sessions suddenly started failing. Unlike that the situation appears to have stabilized without intervention, however we can't say they aren't the same general type of problem. So I think, as with T371879, we probably should try to plan a switch outage here to allow us to power cycle and upgrade it.

TCP Timeouts

There are also strange logs showing constantly on this switch, which I don't see on, for-instance cloudsw1-d5-eqiad (although it is on more recent JunOS):

Sep  2 06:30:06  cloudsw1-c8-eqiad /kernel: tcp_timer_keep: Dropping socket connection due to keepalive timer expiration, idle/intvl/cnt: 1000/1000/5

Those are ongoing as far back as the logs go however, so I am not convinced they are related. From what I can tell they relate to these internal connections the switch is trying to make to itself, and never get a response, but I can't find what service is trying to do this or is making the connections. Could be a red herring.

cmooney@cloudsw1-c8-eqiad> show system connections | match 6997   
tcp4       0      0  128.0.0.16.54563                              128.0.0.1.6997                                SYN_SENT

Servers in the rack

Servers in the rack: https://netbox.wikimedia.org/dcim/racks/24/

Related Objects
Search...

Status	Assigned	Task
Resolved	dcaro	T373986 cloudsw1-c8-eqiad is unstable
Resolved	dcaro	T373972 [infra,k8s,kyverno] Toolforge Kyverno low policy resources tools
Resolved	dcaro	T374043 Drain C8 rack
Resolved	dcaro	T374463 [cloud] Drain B row from cloud* services
Open	None	T375259 cloud: edge network suffers downtime if one cloudsw is down

Event Timeline

aborrero created this task.Sep 4 2024, 9:57 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 4 2024, 9:57 AM

Another somewhat related (maybe) ticket: T371879: cloudsw1-d5-eqiad instability Aug 6 2024

Also somewhat related: T316544: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+

aborrero triaged this task as High priority.Sep 4 2024, 10:01 AM

aborrero added a project: User-aborrero.

cmooney updated the task description. (Show Details)Sep 4 2024, 10:34 AM

cmooney updated the task description. (Show Details)Sep 4 2024, 10:48 AM

Related, we have to install one of the mons that's out of C8 to be able to drain the rack T374005: [ceph] install and put in the cluster the cloudcephmon100[1-3] replacements

dcaro changed the status of subtask T373972: [infra,k8s,kyverno] Toolforge Kyverno low policy resources tools from Open to In Progress.Sep 4 2024, 4:15 PM

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-04T16:35:30Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T373986)

dcaro claimed this task.Sep 4 2024, 4:42 PM

dcaro edited projects, added cloud-services-team (FY2024/2025-Q1-Q2); removed cloud-services-team.

dcaro moved this task from Backlog to In progress on the cloud-services-team (FY2024/2025-Q1-Q2) board.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-04T19:35:48Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T373986)

aborrero moved this task from Backlog to Radar/observer on the User-aborrero board.Sep 5 2024, 9:57 AM

cmooney updated the task description. (Show Details)Sep 5 2024, 11:09 AM

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-05T12:47:37Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-05T12:51:36Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-05T16:48:51Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-05T17:32:07Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-05T21:32:31Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-06T07:27:05Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-06T11:13:15Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-06T13:46:04Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-06T17:58:37Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-06T18:17:37Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-06T23:18:16Z] <dcaro@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-09T09:15:39Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-09T15:04:20Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-09T15:05:49Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T373986)

dcaro closed subtask T373972: [infra,k8s,kyverno] Toolforge Kyverno low policy resources tools as Resolved.Sep 9 2024, 8:34 PM

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-09T21:32:08Z] <dcaro@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T373986)

dcaro mentioned this in T371869: [ceph,network] Intermittent network packets lost .Sep 10 2024, 8:20 AM

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-10T08:45:29Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T373986)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-10T08:46:36Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T373986)

Icinga downtime and Alertmanager silence (ID=6ca197a8-f8fb-4fad-8834-f4d89337e282) set by cmooney@cumin1002 for 1:30:00 on 3 host(s) and their services with reason: Reboot cloudsw1-c8-eqiad and upgrade JunOS

cloudsw1-c8-eqiad,cloudsw1-c8-eqiad IPv6,cloudsw1-c8-eqiad.mgmt

Icinga downtime and Alertmanager silence (ID=b046581a-9da3-41c8-8d93-e0eeee44732e) set by cmooney@cumin1002 for 1:30:00 on 1 host(s) and their services with reason: reboot cloudsw1-c8-eqiad

cr1-eqiad

Icinga downtime and Alertmanager silence (ID=5b4f18de-e6fc-4fd9-ace2-e29e419c51b0) set by cmooney@cumin1002 for 1:30:00 on 24 host(s) and their services with reason: reboot cloudsw1-c8-eqiad

cloudbackup1003.eqiad.wmnet,cloudcephmon[1003-1004].eqiad.wmnet,cloudcephosd[1006-1009,1016-1018,1021-1022,1035].eqiad.wmnet,cloudcontrol1005.eqiad.wmnet,cloudgw1001.eqiad.wmnet,cloudlb1001.eqiad.wmnet,cloudnet1005.eqiad.wmnet,cloudrabbit1001.eqiad.wmnet,cloudservices1006.eqiad.wmnet,cloudvirt[1031-1035].eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=dda4c44a-52fb-46f0-8e02-d2d4b5a2ee3e) set by cmooney@cumin1002 for 0:30:00 on 24 host(s) and their services with reason: reboot cloudsw1-c8-eqiad

cassandra-dev2003.codfw.wmnet,db[2213-2214].codfw.wmnet,es2023.codfw.wmnet,ganeti2016.codfw.wmnet,kubernetes[2022,2046-2047,2056].codfw.wmnet,maps2008.codfw.wmnet,mw[2278-2279,2366-2376].codfw.wmnet,pc2016.codfw.wmnet

The switch upgrade / reboot was successful earlier today which hopefully will mean we don't have any repeat of this incident. All protocols established and looking stable but we'll keep an eye on it.

Things seem stable with this now so I will close the task.

dcaro closed subtask T374463: [cloud] Drain B row from cloud* services as Resolved.Sep 19 2024, 2:17 PM

dcaro mentioned this in T375204: [cloudceph] Improve downtime when a switch goes down.Sep 19 2024, 2:30 PM

aborrero mentioned this in T375259: cloud: edge network suffers downtime if one cloudsw is down.Sep 20 2024, 8:49 AM

fnegri moved this task from In progress to Done on the cloud-services-team (FY2024/2025-Q1-Q2) board.Sep 24 2024, 1:49 PM

taavi added a project: Cloud-VPS.Sep 28 2024, 12:22 PM

dcaro closed subtask T374043: Drain C8 rack as Resolved.Wed, Nov 20, 8:41 AM