When doing a full fleet reboot/upgrade it takes some time to do manually, it'd be worth to automate in a cookbook.
Process:
- Make sure the cluster is healthy (sudo ceph status -> HEALTHY_OK)
- Set the cluster noout+norebalance policies:
- sudo ceph osd set noout
- sudo ceph osd set norebalance
- Downtime the 'Ceph OSDs Down' check on icinga alert1001 host
- Host by host (including control nodes, though those don't need noout/norebalance, it does not hurt):
- sudo cookbook sre.hosts.upgrade-and-reboot --depool-cmd 'true' --repool-cmd 'true' <host-fqdn>
- Wait until ceph cluster is healthy again
- If there's any PGs stuck with 'undersized+remapped..' or similar state, unset norebalance for a bit and set after:
- sudo ceph osd unset norebalance
- Wait until cluster healthy
- sudo ceph osd set norebalance
- If there's any PGs stuck with 'undersized+remapped..' or similar state, unset norebalance for a bit and set after: