The D5 switch started misbehaving and started flapping non-physical interfaces (LAG, bgp, ...) creating short periodical network outages that are affecting the whole network (from ceph to cloudgw). See T371879: cloudsw1-d5-eqiad instability Aug 6 2024
In order to continue debugging and hopefully fixing it, we have to reboot the switch, that means taking the whole D5 rack down, this task it to decide how and when to take that rack down.
The reboot should take ~15min (best case scenario).
Hosts in that rack (https://netbox.wikimedia.org/dcim/racks/39/):
- cloudbackup1004 - ok
- cloudcephmon1002 - ok
- cloudcephosd1011 - cloudvps (and subprojects) outage
- cloudcephosd1012 - cloudvps (and subprojects) outage
- cloudcephosd1013 - cloudvps (and subprojects) outage
- cloudcephosd1014 - cloudvps (and subprojects) outage
- cloudcephosd1015 - cloudvps (and subprojects) outage
- cloudcephosd1019 - cloudvps (and subprojects) outage
- cloudcephosd1020 - cloudvps (and subprojects) outage
- cloudcephosd1023 - cloudvps (and subprojects) outage
- cloudcephosd1024 - cloudvps (and subprojects) outage
- cloudcephosd1036 - cloudvps (and subprojects) outage
- cloudcontrol1006
- cloudcontrol1008-dev - ok (not in use)
- cloudgw1002 -
- cloudlb1002 -
- cloudnet1006 - should be ok (self-HA)
- cloudservices1005
- cloudvirt1036 - bound to ceph
- cloudvirt1037 - bound to ceph
- cloudvirt1038 - bound to ceph
- cloudvirt1039 - bound to ceph
- cloudvirt1040 - bound to ceph
- cloudvirt1041 - bound to ceph
- cloudvirt1042 - bound to ceph
- cloudvirt1043 - bound to ceph
- cloudvirt1044 - bound to ceph
- cloudvirt1045 - bound to ceph
- cloudvirt1046 - bound to ceph
- cloudvirt1047 - bound to ceph
- cloudvirtlocal1001
Notes:
- Ceph will have to go down, as the number of osds is too high for the cluster to be able to rebalance, this means full outage of VMs/toolforge/quarry/paws/....
Current plan
- Cathal gets new cloudcephosd nodes online (T363344)
- David bring all the nodes in the cluster (except the one already on D5) - Not needed (will not help get space)
- 1035 (kinda, one OSD drive is not ok, will need reimage later)
- 1037 <- in progress
- 1038
- David drains as many affected OSD nodes as possible
- cloudcephosd1011
- cloudcephosd1012
- cloudcephosd1013
- cloudcephosd1014
- cloudcephosd1015
- cloudcephosd1019
- cloudcephosd1020
- cloudcephosd1023
- cloudcephosd1024
- Andrew depools all affected cloudvirts, drains toolforge nodes
- Andrew drains all affected cloudvirts (Except four cloudvirtlocal1001)
- Do the reboot/upgrade when John is standing by with a spare switch