Page MenuHomePhabricator

[ceph] Automate fleet upgrade/restart
Closed, ResolvedPublic

Description

When doing a full fleet reboot/upgrade it takes some time to do manually, it'd be worth to automate in a cookbook.
Process:

  • Make sure the cluster is healthy (sudo ceph status -> HEALTHY_OK)
  • Set the cluster noout+norebalance policies:
    • sudo ceph osd set noout
    • sudo ceph osd set norebalance
  • Downtime the 'Ceph OSDs Down' check on icinga alert1001 host
  • Host by host (including control nodes, though those don't need noout/norebalance, it does not hurt):
    • sudo cookbook sre.hosts.upgrade-and-reboot --depool-cmd 'true' --repool-cmd 'true' <host-fqdn>
    • Wait until ceph cluster is healthy again
      • If there's any PGs stuck with 'undersized+remapped..' or similar state, unset norebalance for a bit and set after:
        • sudo ceph osd unset norebalance
        • Wait until cluster healthy
        • sudo ceph osd set norebalance