Page MenuHomePhabricator

Cookbook for rack depool
Open, MediumPublic

Description

More like a sketch than a full on solution.

It would be quite valuable for network maintenance and outage response to have a cookbook allowing to depool all servers in a given rack.
Full row is not necessary as we're moving toward a per rack redundancy model (and we could in extreme cases run it multiple times).

I think most of the building blocks are there (restart cookbooks, restart doc), and what's needed is the glue between them.

An idea is to have per team or server types cookbooks, that are themselves called by this meta cookbook. Hosts with no downtime cookbook would be listed in the script output.
Many hosts also have a depool command line tool, that takes care of all the depool action, for such hosts the cookbook could do it via cumin (which would only work for maintenance and not when the rack is down).
Cookbook could also check etcd if it manages that host's pooled status.

For example:
sudo cookbook sre.network.depool-rack B2 --dry-run or show would list the hosts to be depooled and their status (polled or not), if they have a depool mechanism associated.
Host AAAA needs manual depool
Host DB1234 is a master DB, needs manual depool
Host BBBB is currently pooled and would be depooled via XXX
Host ganeti1234 needs to be drained from 10 VMs
etc

Without the --dry-run the script could ask "do you want to depool AAAA [y/N]" and a --force would just do it and list at the end what still need to happen manually. Once we're comfortable with it we could even have the --force behavior as default, as depooling all servers from a given rack shouldn't be an issue (minus special hosts like master DBs).

Getting the list of servers from a rack is easy thanks to Netbox, mapping those servers to the proper depool mechanism is yet to be solved.
Additionally this should be well documented so service owners can easily add mechanisms to depool their hosts.

Last, each host specific depool mechanisms (eg. cookbooks) could have a way to implement checks and safeguards (eg. having a lock to prevent a specific host to be depooled). But that would be fine to implement in a later iteration.

Of course, sudo cookbook sre.network.repool-rack B2 would do the other way around.

I'm sure there is a lot of edge cases, but if this can help reduce them or even just work on half the servers present in a rack it would be great improvement. Same goes with the incentive for server owners to not need to be involved during single rack maintenance or outage.