Page MenuHomePhabricator

[ceph] Do a periodical all round machine reboot
Closed, InvalidPublic

Description

This achieves several goals, test and ensure that the hosts can reboot without issuse (kernel/grub/etc.); make sure that the cluster can lose one host without any issues; exercise our skillset on moving aronud pieces of the cluster. My goal is to do this periodically, and very likely automate most of it.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dcaro triaged this task as Medium priority.Feb 1 2021, 10:57 AM
dcaro renamed this task from [ceph] Do an all round machine reboot to [ceph] Do a periodical all round machine reboot.Mar 9 2021, 5:16 PM

Great idea. Once we find success with this approach with ceph based nodes, I would encourage a similar cadence to apply to HV's and other machines we operates.

Future looking, I'm curious about maintenance operations occurring during this scheduled downtime. I'll note this is beyond scope for this and very future looking. However, what about canary testing updates as part of this process? AKA, can we update packages, OS, kernel or firmware during this downtime and reintroduce them to the cluster without ill effect?