On 2023-02-13
Incident 1
Cause
At around 14:30 UTC @dcaro takes down two OSD hosts (cloudceph1001/1002) to be moved to the rack e4, this sets some placement groups as read only, causing some VMs to fail some writes to disk, and many VMs to get stuck and be marked as down.
Resolution
This was fixed ~ 10 minutes later by allowing the cluster to rebalance (ceph osd unset norebalance + ceph osd unset noout), that started shifting data and creating the missing replicas, restoring the placement groups.
At the same time, the hosts were moved and were ready to be reimaged, but to do so they needed some extra configuration to be set in the switches ports.
Incident 2
Cause
The new ports were configured without specifying an MTU, and not yet set as up, and that seemed to trigger an issue with the Juniper switch in which the rest of the ports in the same VLAN would intermittently drop jumbo packets (MTU>1500) (more details here https://phabricator.wikimedia.org/T329535#8612670).
A few minutes later, around 16:30 UTC we had a total outage of the Cloud Ceph cluster (cloudceph*.eqiad.wmnet). The OSD daemons were flagging other OSD hosts as down, and the monitor nodes were forcing them to stop (due to health probes with MTU>1500 failing).
Immediate measures taken
To prevent any data corruption, we did shut down all the OpenStack hypervisors (cloudvirt*.eqiad.wmnet), effectively turning off Cloud VPS and its related services (Toolforge, etc.).
Resolution
The fix was to remove the configuration for those new ports (note that they were never up), and manually starting all the osd daemons in the cluster. That eventually brought the cluster back up and running.
Followed by powering up all the hypervisors, and making sure that the VMs were starting correctly (see followup tickets for details)
See also the comments below for more details.