Page MenuHomePhabricator

[network,D5] reboot cloudsw-d5
Closed, ResolvedPublic

Description

The D5 switch started misbehaving and started flapping non-physical interfaces (LAG, bgp, ...) creating short periodical network outages that are affecting the whole network (from ceph to cloudgw). See T371879: cloudsw1-d5-eqiad instability Aug 6 2024

In order to continue debugging and hopefully fixing it, we have to reboot the switch, that means taking the whole D5 rack down, this task it to decide how and when to take that rack down.

The reboot should take ~15min (best case scenario).

Hosts in that rack (https://netbox.wikimedia.org/dcim/racks/39/):

  • cloudbackup1004 - ok
  • cloudcephmon1002 - ok
  • cloudcephosd1011 - cloudvps (and subprojects) outage
  • cloudcephosd1012 - cloudvps (and subprojects) outage
  • cloudcephosd1013 - cloudvps (and subprojects) outage
  • cloudcephosd1014 - cloudvps (and subprojects) outage
  • cloudcephosd1015 - cloudvps (and subprojects) outage
  • cloudcephosd1019 - cloudvps (and subprojects) outage
  • cloudcephosd1020 - cloudvps (and subprojects) outage
  • cloudcephosd1023 - cloudvps (and subprojects) outage
  • cloudcephosd1024 - cloudvps (and subprojects) outage
  • cloudcephosd1036 - cloudvps (and subprojects) outage
  • cloudcontrol1006
  • cloudcontrol1008-dev - ok (not in use)
  • cloudgw1002 -
  • cloudlb1002 -
  • cloudnet1006 - should be ok (self-HA)
  • cloudservices1005
  • cloudvirt1036 - bound to ceph
  • cloudvirt1037 - bound to ceph
  • cloudvirt1038 - bound to ceph
  • cloudvirt1039 - bound to ceph
  • cloudvirt1040 - bound to ceph
  • cloudvirt1041 - bound to ceph
  • cloudvirt1042 - bound to ceph
  • cloudvirt1043 - bound to ceph
  • cloudvirt1044 - bound to ceph
  • cloudvirt1045 - bound to ceph
  • cloudvirt1046 - bound to ceph
  • cloudvirt1047 - bound to ceph
  • cloudvirtlocal1001

Notes:

  • Ceph will have to go down, as the number of osds is too high for the cluster to be able to rebalance, this means full outage of VMs/toolforge/quarry/paws/....

Current plan

  • Cathal gets new cloudcephosd nodes online (T363344)
  • David bring all the nodes in the cluster (except the one already on D5) - Not needed (will not help get space)
    • 1035 (kinda, one OSD drive is not ok, will need reimage later)
    • 1037 <- in progress
    • 1038
  • David drains as many affected OSD nodes as possible
    • cloudcephosd1011
    • cloudcephosd1012
    • cloudcephosd1013
    • cloudcephosd1014
    • cloudcephosd1015
    • cloudcephosd1019
    • cloudcephosd1020
    • cloudcephosd1023
    • cloudcephosd1024
  • Andrew depools all affected cloudvirts, drains toolforge nodes
  • Andrew drains all affected cloudvirts (Except four cloudvirtlocal1001)
  • Do the reboot/upgrade when John is standing by with a spare switch

Details

Related Changes in Gerrit:

Event Timeline

dcaro triaged this task as High priority.Aug 6 2024, 9:57 AM

Hi @dcaro, can you please associate one or more active project tags with this task (via the Add Action...Change Project Tags dropdown)? That will allow to see a task when looking at project workboards or searching for tasks in certain projects, and get notified about a task when watching a related project tag. Thanks!

Hi @dcaro, can you please associate one or more active project tags with this task (via the Add Action...Change Project Tags dropdown)? That will allow to see a task when looking at project workboards or searching for tasks in certain projects, and get notified about a task when watching a related project tag. Thanks!

:facepalm: sure!

Current plan:

  • Cathal gets new cloudcephosd nodes online (T363344)
  • David drains as many affected OSD nodes as possible
  • Andrew depools all affected cloudvirts
  • Andrew drains toolforge nodes off affected cloudvirts
  • Andrew drains the rest of the cloudvirts if there's still space
  • (Then we see how close to all the way drained we can get)
  • Do the reboot/upgrade when John is standing by with a spare switch

Current plan:

Thanks! Moved it to the task description :)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-06T15:43:23Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-06T15:44:04Z] <dcaro@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T371878)

It's probably best to manually flip the HA on the cloudgw/cloudnet nodes to the ones in rack C8 before we start. I just checked and the two nodes in rack D5 (cloudnet1006 and cloudgw1002) are - of course - the current live ones.

It's probably best to manually flip the HA on the cloudgw/cloudnet nodes to the ones in rack C8 before we start. I just checked and the two nodes in rack D5 (cloudnet1006 and cloudgw1002) are - of course - the current live ones.

It seems this is non-trivial to do for keepalived or in neutron so we might just have to expect 3 second interruption :(

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-06T19:51:41Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-06T21:22:20Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.drain_node (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-07T01:18:29Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-07T03:08:55Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-07T06:39:20Z] <andrew@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-07T08:11:10Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.bootstrap_and_add (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-07T08:26:47Z] <wmbot~dcaro@urcuchillay> END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) (T371878)

I think adding more storage to F4 will not help with the cluster total usage, as the limiting rack now is E4:

root@cloudcephosd1010:~# ceph osd tree | grep -i rack
-83         151.97609      rack C8                                             
-81          83.84979      rack D5                                             
-77         111.79874      rack E4                                             
-79         125.76807      rack F4

what the cluster uses is the maximum usage of the availability zones (racks right now), so that would be E4 (the middle column there is the aggregated weight, that is the total space, used + available in TB).

This means that we can start draining D5 without adding 1038, as it will not help reducing the % usage.

The other thread to pull is cleaning up space manually, I'll focus on that then.

It seems we have a bunch of stuff in the 'ceph trash' for the cinder pool (not the compute one), I'll look a bit more, but seems promising:

root@cloudcephosd1037:~# rbd pool stats eqiad1-cinder
Total Images: 330 (269 in trash)
Total Snapshots: 364 (302 in trash)
Provisioned Size: 81 TiB (12 TiB in trash)

this sounds familiar....

root@cloudcephosd1037:~# rbd trash purge eqiad1-cinder
2024-08-08 08:30:22.044 7f5df9771700 -1 librbd::image::PreRemoveRequest: 0x7f5de8001a40 check_image_snaps: image has snapshots - not removing
Removing images: 0% complete...failed.

All of them are expired, so that would be nice to clear, looking into the snapshots:

root@cloudcephosd1037:~# rbd trash ls  eqiad1-cinder -l
ID             NAME                                        SOURCE DELETED_AT               STATUS                              PARENT                                                                                                  
00bb3f7a6fe590 volume-7d3b836d-892a-40f6-a1d0-ce3dd2c9224d USER   Wed May 29 14:13:10 2024 expired at Wed May 29 14:13:10 2024                                                                                                         
01909a378965f4 volume-287c614a-1a17-4cd9-beff-86a4e4fe0f2a USER   Thu May 30 15:35:53 2024 expired at Thu May 30 15:35:53 2024                                                                                                         
01916fb523d891 volume-12c08ac1-f6b9-4212-a8c8-1800fc267b14 USER   Thu May 30 15:35:54 2024 expired at Thu May 30 15:35:54 2024                                                                                                         
01b515d1a4b036 volume-faecb146-d012-4cf6-a1ef-87164269da00 USER   Thu May 30 15:14:26 2024 expired at Thu May 30 15:14:26 2024                                                                                                         
02bd2960d8495b volume-ab5d3caf-fd2b-4ed2-a9a0-214345403a16 USER   Thu May 30 15:14:34 2024 expired at Thu May 30 15:14:34 2024                                                                                                         
0489f25117af8b volume-c4c0636d-f898-4523-b0f9-f6f393522f2e USER   Fri Jun 14 19:09:14 2024 expired at Fri Jun 14 19:09:14 2024         
...

So, in order to delete those and purge the trash, we have to:

> root@cloudcephosd1037:~# rbd trash ls eqiad1-cinder  # gives trash_id image_id pairs
fef8d9d8a77ec6 volume-b471f2fd-af23-4fac-ba38-a86f283afa3c
ff6ef4fcd227e3 volume-98c764bf-69ff-4c65-8f00-70cbb8bc4f96
ff8c6dcfb52d57 volume-ac72f5ed-95cf-45cc-bb67-84ce686281a8
ff8c794c3aee2e volume-6faff28f-b0db-46c9-9ca6-e613076c58f2
...

> rbd trash restore eqiad1-cinder/fe77559ce04331  # that is the trash id, not the image id
> rbd snap purge eqiad1-cinder/volume-477b09ba-48af-4e4c-a73d-7a76aeeabc35  # this one is the image id
> rbd rm eqiad1-cinder/volume-477b09ba-48af-4e4c-a73d-7a76aeeabc35  # same image id

The root cause of these expired images that are not getting cleaned up is T358774: [wmcs-backup] Backup snapshots of deleted volumes are never cleaned up.

An alternative to your approach in the previous comment is to delete the snapshots while the image is in the trash, as described in T358774#9590874. It still requires a few commands, so not sure if there's any advantage.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-08T11:25:07Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.undrain_node (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-08T11:35:17Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=99) (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-08T18:25:17Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-08T18:25:27Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-08T18:26:19Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-08T18:26:23Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-08T18:27:19Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-08T18:28:32Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-08T23:05:43Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-09T00:46:57Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-09T00:47:10Z] <andrew@cloudcumin1001> END (ERROR) - Cookbook wmcs.ceph.osd.drain_node (exit_code=97) (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-09T00:47:21Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-09T05:35:08Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.drain_node (exit_code=0) (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-09T05:36:23Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-09T11:27:15Z] <andrew@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-09T13:34:15Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-09T13:35:53Z] <andrew@cloudcumin1001> END (ERROR) - Cookbook wmcs.ceph.osd.drain_node (exit_code=97) (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-09T13:36:19Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-09T13:38:28Z] <andrew@cloudcumin1001> END (ERROR) - Cookbook wmcs.ceph.osd.drain_node (exit_code=97) (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-09T13:38:32Z] <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.drain_node (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-09T18:39:36Z] <andrew@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-09T19:16:39Z] <andrew@cloudcumin1001> END (FAIL) - Cookbook wmcs.ceph.osd.drain_node (exit_code=99) (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-12T08:26:39Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.drain_rack (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-12T08:26:52Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.drain_rack (exit_code=99) (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-12T08:27:39Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.drain_rack (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-12T08:32:48Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.ceph.osd.drain_rack (exit_code=99) (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-12T08:37:16Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.ceph.osd.drain_rack (T371878)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-12T08:37:21Z] <wmbot~dcaro@urcuchillay> END (ERROR) - Cookbook wmcs.ceph.osd.drain_rack (exit_code=97) (T371878)

Icinga downtime and Alertmanager silence (ID=50666174-cba4-46b9-8fa9-cdf8d3361058) set by cmooney@cumin1002 for 0:40:00 on 7 host(s) and their services with reason: prep JunOS upgrade cloudsw1-d5-eqiad

cloudsw1-c8-eqiad.mgmt,cloudsw1-d5-eqiad,cloudsw1-d5-eqiad IPv6,cloudsw1-d5-eqiad.mgmt,cloudsw1-e4-eqiad.mgmt,cloudsw1-f4-eqiad.mgmt,cr2-eqiad

Icinga downtime and Alertmanager silence (ID=3db725ef-06d9-4ef6-8e5f-eecd4b7c5f0f) set by cmooney@cumin1002 for 0:30:00 on 30 host(s) and their services with reason: JunOS upgrade cloudsw1-d5-eqiad

cloudbackup1004.eqiad.wmnet,cloudcephmon1002.eqiad.wmnet,cloudcephosd[1011-1015,1019-1020,1023-1024,1036].eqiad.wmnet,cloudcontrol1006.eqiad.wmnet,cloudgw1002.eqiad.wmnet,cloudlb1002.eqiad.wmnet,cloudnet1006.eqiad.wmnet,cloudservices1005.eqiad.wmnet,cloudvirt[1036-1047].eqiad.wmnet,cloudvirtlocal1001.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=dc4b0cc4-3c33-4edc-986a-ec197a69be11) set by cmooney@cumin1002 for 0:40:00 on 7 host(s) and their services with reason: prep JunOS upgrade cloudsw1-d5-eqiad

cloudsw1-c8-eqiad.mgmt,cloudsw1-d5-eqiad,cloudsw1-d5-eqiad IPv6,cloudsw1-d5-eqiad.mgmt,cloudsw1-e4-eqiad.mgmt,cloudsw1-f4-eqiad.mgmt,cr2-eqiad

Icinga downtime and Alertmanager silence (ID=fc3e8669-ac2c-40c1-a2bd-cb21a07c546c) set by cmooney@cumin1002 for 0:20:00 on 30 host(s) and their services with reason: JunOS upgrade cloudsw1-d5-eqiad

cloudbackup1004.eqiad.wmnet,cloudcephmon1002.eqiad.wmnet,cloudcephosd[1011-1015,1019-1020,1023-1024,1036].eqiad.wmnet,cloudcontrol1006.eqiad.wmnet,cloudgw1002.eqiad.wmnet,cloudlb1002.eqiad.wmnet,cloudnet1006.eqiad.wmnet,cloudservices1005.eqiad.wmnet,cloudvirt[1036-1047].eqiad.wmnet,cloudvirtlocal1001.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=15f30d47-cb35-4a71-a13e-bd0b11e61af8) set by cmooney@cumin1002 for 6:00:00 on 7 host(s) and their services with reason: prep for replacement of cloudsw1-d5-eqiad

cloudsw1-c8-eqiad.mgmt,cloudsw1-d5-eqiad,cloudsw1-d5-eqiad IPv6,cloudsw1-d5-eqiad.mgmt,cloudsw1-e4-eqiad.mgmt,cloudsw1-f4-eqiad.mgmt,cr2-eqiad

Quick summary:

Cathal upgraded and rebooted the switch on Tuesday the 13th. That did not solve the flapping. Vriley then suggested that we do a physical powerdown and yank the power plug. After /that/ the switch came up and is behaving properly.

After 24 hours it was still working properly so we are now gradually re-pooling the attached ceph nodes. All is well so far!

Andrew claimed this task.

I've now repooled all affected ceph nodes (and rebuilt cloudcephosd1035) and repooled all cloudvirts. Until the switch flakes out again this is resolved! thx @cmooney @dcaro @VRiley-WMF

Change #1084104 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[labs/private@master] Remove wikitech stub cert

https://gerrit.wikimedia.org/r/1084104

Change #1084104 merged by Muehlenhoff:

[labs/private@master] Remove wikitech stub cert

https://gerrit.wikimedia.org/r/1084104