Servers can be now depooled from pybal using conftool (https://wikitech.wikimedia.org/wiki/Conftool) as an external library.
So what scap should do when doing a rolling restart is something like:
# Get the list of all pooled servers, per datacenter, in api, rendering, jobrunners and apache
# create per-datacenter sets that will include not more than max(1,5%) of all servers in a cluster
# for each set:
# Depool all the servers. This can be done in parallel probably
# Wait N seconds (for now, in a not-too-distant future we'll have a way to verify depooling, but this is *good enough*)
# for each server: Restart HHVM, verify rendering works again (use the pybal proxyfecth url if possible), repool
This should prevent us from restarting servers while they're pooled and should minimize the number of 503s we'll see. The jobrunners might need some special handling though.
How to use conftool?
- Include the conftool puppet class. It still needs some work to be used in labs, where pybal is not present anyways...
- In your python program, it's as simple as:
from conftool import configuration, KVObject, node
c = configuration.get("/etc/conftool/config.yaml")
# For now you need to have the datacenter, cluster, servicename in order to find a node, it will be better
n = node.Node('eqiad', 'appserver', 'apache2', 'mw1019.eqiad.wmnet')
# Depool a node
n.pooled = "no"
# Pool the node again
n.pooled = "yes"
Since conftool has not been thought as an external library but as a specific set of tools, the api could be better, and we can work on it.