High level plan:
- Create some common infrastructure for a rolling reboot cookbook in form of library functions which are then only glued together with as little code duplication as possible.
- For each service a specific cookbook is instantiated in the form or sre.service.reboot (e.g. sre.swift.reboot) and a separate cookbook to create one off hosts as sre.hosts.reboot-single
What should the general reboot framework for reboots cover?
- targets a set of hosts addressed by a Cumin alias or a Cumin host globbing
- it takes a target batch size of hosts (with an upper bound specified in digits and percentage (as sizes can differ per DC)
- the batch is down timed and depooled
- reboots are triggered for all servers in the batch in parallel
- for the batch of rebooted hosts the availability of the host is checked (Icinga checks recovering plus potential service-specific metrics/tool/consistency check)
- once the whole batch is recovered, repool and start with the next batch
When adding a new service-specific cookbook (e.g. thumbor), then the new sre.reboot.thumbor cookbook would only need to define the acceptable level of hosts being down and the criteria to confirm a host is back up fine. And after it's initially vetted, the reboot process is much more future-proof than a generic cookbook which might be misused.
Once such a generic framework exists, the existing cookbooks could also be converted over:
- sre.elasticsearch.rolling-reboot: Rolling reboot of elasticsearch servers
- sre.hadoop.reboot-workers: Reboot Hadoop worker nodes
- sre.maps.reboot: Maps reboot cookbook
- sre.wdqs.reboot: WDQS reboot cookbook
(and potentially also fold it sre.hosts.upgrade-and-reboot one way or the other, the upgrade part is a bit of an antipattern)
Available tools in Spicerack:
- Since the first development of that CR Spicerack has now a better support for running cumin commands on clusters behind load balancers, see LBRemoteCluster. This should be used to manage any cluster behind a load balancer for its additional features and safety nets that provides.
- In case we want this new cookbook to work only on hosts not behind a load balancer, we should enforce that checking with confctl that that's indeed the case.
- What about hosts that are part of clusters that have a more specialized cookbook that should be used instead?Should we detect it and tell the user the other cookbook should be used? At the same time all of those cookbooks restart the whole cluster AFAIK.
- Still in the case that we want this new cookbook to work only on hosts not behind a load balancer, we should use a less generic name than reboot that implies that it works with any host.
- As for the Icinga downtime we have a way to force a recheck of all the services that should be useful to prevent false alarms.
- Again for Icinga, we can now get the status of the checks for a given set of hosts, allowing to quickly see if they are in optimal state. See get_status().