[ceph] Automate fleet upgrade/restart
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcaro
	Feb 26 2021, 3:33 PM

Description

When doing a full fleet reboot/upgrade it takes some time to do manually, it'd be worth to automate in a cookbook.
Process:

Make sure the cluster is healthy (sudo ceph status -> HEALTHY_OK)
Set the cluster noout+norebalance policies:
- sudo ceph osd set noout
- sudo ceph osd set norebalance
Downtime the 'Ceph OSDs Down' check on icinga alert1001 host
Host by host (including control nodes, though those don't need noout/norebalance, it does not hurt):
- sudo cookbook sre.hosts.upgrade-and-reboot --depool-cmd 'true' --repool-cmd 'true' <host-fqdn>
- Wait until ceph cluster is healthy again
  - If there's any PGs stuck with 'undersized+remapped..' or similar state, unset norebalance for a bit and set after:
    - sudo ceph osd unset norebalance
    - Wait until cluster healthy
    - sudo ceph osd set norebalance

dcaro triaged this task as Medium priority.Feb 26 2021, 3:33 PM

dcaro created this task.

Done