Page MenuHomePhabricator

Make scap able to depool/repool servers via the conftool API
Closed, ResolvedPublic

Description

Servers can be now depooled from pybal using conftool (https://wikitech.wikimedia.org/wiki/Conftool) as an external library.

So what scap should do when doing a rolling restart is something like:

  1. Get the list of all pooled servers, per datacenter, in api, rendering, jobrunners and apache
  2. create per-datacenter sets that will include not more than max(1,5%) of all servers in a cluster
  3. for each set:
    1. Depool all the servers. This can be done in parallel probably
    2. Wait N seconds (for now, in a not-too-distant future we'll have a way to verify depooling, but this is *good enough*)
    3. for each server: Restart HHVM, verify rendering works again (use the pybal proxyfecth url if possible), repool

This should prevent us from restarting servers while they're pooled and should minimize the number of 503s we'll see. The jobrunners might need some special handling though.

How to use conftool?

  • Include the conftool puppet class. It still needs some work to be used in labs, where pybal is not present anyways...
  • In your python program, it's as simple as:
from conftool import configuration, KVObject, node
c = configuration.get("/etc/conftool/config.yaml")
KVObject.setup(c)
# For now you need to have the datacenter, cluster, servicename in order to find a node, it will be better
n = node.Node('eqiad', 'appserver', 'apache2', 'mw1019.eqiad.wmnet')
# Depool a node
n.pooled = "no"
n.write()
# Pool the node again
n.pooled = "yes"
n.write()

Since conftool has not been thought as an external library but as a specific set of tools, the api could be better, and we can work on it.

Revisions and Commits

Event Timeline

Joe raised the priority of this task from to High.
Joe updated the task description. (Show Details)
Joe added subscribers: MZMcBride, dduvall, demon and 16 others.

Is there any way we can discover the topography from conftool? We have a list of mw servers from the dsh file but currently scap has no way to know anything more about the target hosts.

Note that this problem statement could very well be expanded to all of our application clusters. I just think we need a bit more tooling (as in: evolution of conftool) before we can get to that.

For the current workflow of the scap family of tools, it would be easiest if we could select a list of servers and a batch size from the deploy server side (eg tin) and then run a script via ssh on each host that did the depool, restart, verify, repool steps. It wouldn't be impossible to change our job class so that there was a deploy server side pre/post component for each host but it would take more time to test and develop. The existing tooling fairly robustly handles the "run this command on these servers N at a time" pattern.

@bd808: Note that Release-Engineering-Team is working on the next-generation of deployment tooling, and I think we had envisioned doing it exactly as you described - run 'depool, restart, verify, repool' steps, all on the target host.

mmodell lowered the priority of this task from High to Low.Jul 6 2015, 6:12 PM
mmodell raised the priority of this task from Low to Medium.
mmodell moved this task from To Triage to Externally Blocked on the Deployments board.

Started talking about this at the deployment cabal meeting today.

The first use-case is in the promote or restart steps that could be done serially as part of a rolling-deploy for a service. The stage at which a particular deployment depools may have to be configurable. Particularly if we split out code promote (swapping symlinks to latest code) and service restart. By default, it seems like the best time to depool a server is before swapping its code with updated code; however, this may not always be the case. In some instances, the running service may not be affected by changing code on disk and may only be affected post-service-restart, in which case, it'd be better to depool before a service restart (if that is a discreet step. This is dependent on the resolution of T119449).

My understanding (after the meeting this morning) is that it may be unclear if a given server is actually depooled after performing depooling steps. Scap tooling may also have to become aware of the max-percentage of servers for a particular service that can be depooled (available via puppet).

Hopefully @Joe will have some time in a few weeks to get something setup in beta for testing Scap implementation. Posting this info here to make sure we don't loose our place in the discussion.

thcipriani added a revision: Restricted Differential Revision.Oct 12 2016, 3:35 PM

Change 514660 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] conftool::scripts: add a safe-service-restart script

https://gerrit.wikimedia.org/r/514660

Change 514660 merged by Giuseppe Lavagetto:
[operations/puppet@production] conftool::scripts: add a safe-service-restart script

https://gerrit.wikimedia.org/r/514660

I think this one should be resolved now?

I think this one should be resolved now?

ping - do people agree?

No reply to last comments; assuming this is resolved