Make scap able to depool/repool servers via the conftool API
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	Joe
	Jun 30 2015, 4:48 PM

Description

Servers can be now depooled from pybal using conftool (https://wikitech.wikimedia.org/wiki/Conftool) as an external library.

So what scap should do when doing a rolling restart is something like:

Get the list of all pooled servers, per datacenter, in api, rendering, jobrunners and apache
create per-datacenter sets that will include not more than max(1,5%) of all servers in a cluster
for each set:
1. Depool all the servers. This can be done in parallel probably
2. Wait N seconds (for now, in a not-too-distant future we'll have a way to verify depooling, but this is *good enough*)
3. for each server: Restart HHVM, verify rendering works again (use the pybal proxyfecth url if possible), repool

This should prevent us from restarting servers while they're pooled and should minimize the number of 503s we'll see. The jobrunners might need some special handling though.

How to use conftool?

Include the conftool puppet class. It still needs some work to be used in labs, where pybal is not present anyways...
In your python program, it's as simple as:

from conftool import configuration, KVObject, node
c = configuration.get("/etc/conftool/config.yaml")
KVObject.setup(c)
# For now you need to have the datacenter, cluster, servicename in order to find a node, it will be better
n = node.Node('eqiad', 'appserver', 'apache2', 'mw1019.eqiad.wmnet')
# Depool a node
n.pooled = "no"
n.write()
# Pool the node again
n.pooled = "yes"
n.write()

Since conftool has not been thought as an external library but as a specific set of tools, the api could be better, and we can work on it.

Details

	Subject	Repo	Branch	Lines +/-
	conftool::scripts: add a safe-service-restart script	operations/puppet	production	+234 -0

Customize query in gerrit

Revisions and Commits

Restricted Differential Revision

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		Legoktm	T67289 Use semantic versioning scheme for WMF (all) releases
Resolved		• GWicke	T102550 Use semantic versioning for services (for consistency with mediawiki core)
Resolved		• mmodell	T94620 [EPIC] The future of MediaWiki deployment: Tooling
Open	Feature	None	T22085 [scap] Local sync script on any individual server should be atomic
Resolved		None	T125629 Depool proxies temporarily while scap is ongoing to avoid taxing those nodes
Resolved		None	T104352 Make scap able to depool/repool servers via the conftool API
Resolved		Joe	T73212 Make it possible to quickly and programmatically pool and depool application servers
Resolved		None	T115899 Move scap target configuration to etcd
Resolved		Joe	T163565 Install conftool on deployment masters

Event Timeline

Joe created this task.Jun 30 2015, 4:48 PM

Joe raised the priority of this task from to High.

Joe updated the task description. (Show Details)

Joe added projects: Performance-Team, acl*sre-team, Release-Engineering-Team, Deployments, HHVM.

Joe added subscribers: • MZMcBride, dduvall, • demon and 16 others.

Is there any way we can discover the topography from conftool? We have a list of mw servers from the dsh file but currently scap has no way to know anything more about the target hosts.

Note that this problem statement could very well be expanded to all of our application clusters. I just think we need a bit more tooling (as in: evolution of conftool) before we can get to that.

For the current workflow of the scap family of tools, it would be easiest if we could select a list of servers and a batch size from the deploy server side (eg tin) and then run a script via ssh on each host that did the depool, restart, verify, repool steps. It wouldn't be impossible to change our job class so that there was a deploy server side pre/post component for each host but it would take more time to test and develop. The existing tooling fairly robustly handles the "run this command on these servers N at a time" pattern.

@bd808: Note that Release-Engineering-Team is working on the next-generation of deployment tooling, and I think we had envisioned doing it exactly as you described - run 'depool, restart, verify, repool' steps, all on the target host.

• mmodell added a parent task: T94620: [EPIC] The future of MediaWiki deployment: Tooling.Jul 1 2015, 10:13 PM

Krinkle moved this task from Inbox, needs triage to Backlog: Maintenance, non-prioritized on the Performance-Team board.Jul 2 2015, 5:21 AM

• mmodell lowered the priority of this task from High to Low.Jul 6 2015, 6:12 PM

• mmodell raised the priority of this task from Low to Medium.

• mmodell moved this task from To Triage to Externally Blocked on the Deployments board.

ori moved this task from Backlog: Maintenance, non-prioritized to Doing (old) on the Performance-Team board.Aug 13 2015, 2:04 AM

• mmodell mentioned this in T73212: Make it possible to quickly and programmatically pool and depool application servers.Sep 23 2015, 3:13 PM

greg moved this task from INBOX to Backlog (ARCHIVED) on the Release-Engineering-Team board.Sep 24 2015, 1:35 AM

greg added a subtask: T73212: Make it possible to quickly and programmatically pool and depool application servers.Sep 24 2015, 1:49 AM

greg removed a project: Release-Engineering-Team.

greg set Security to None.

ori moved this task from Doing (old) to Blocked (old) on the Performance-Team board.Oct 19 2015, 6:16 PM

bd808 mentioned this in T103886: Translation cache exhaustion caused by changes to PHP code in file scope.Nov 11 2015, 5:06 PM

thcipriani mentioned this in T119449: Need a way to restart services without deploying via scap.Nov 23 2015, 10:21 PM

Started talking about this at the deployment cabal meeting today.

The first use-case is in the promote or restart steps that could be done serially as part of a rolling-deploy for a service. The stage at which a particular deployment depools may have to be configurable. Particularly if we split out code promote (swapping symlinks to latest code) and service restart. By default, it seems like the best time to depool a server is before swapping its code with updated code; however, this may not always be the case. In some instances, the running service may not be affected by changing code on disk and may only be affected post-service-restart, in which case, it'd be better to depool before a service restart (if that is a discreet step. This is dependent on the resolution of T119449).

My understanding (after the meeting this morning) is that it may be unclear if a given server is actually depooled after performing depooling steps. Scap tooling may also have to become aware of the max-percentage of servers for a particular service that can be depooled (available via puppet).

Hopefully @Joe will have some time in a few weeks to get something setup in beta for testing Scap implementation. Posting this info here to make sure we don't loose our place in the discussion.

thcipriani added a project: Scap.Nov 23 2015, 10:44 PM

bd808 mentioned this in T22085: [scap] Local sync script on any individual server should be atomic.Dec 10 2015, 11:25 PM

bd808 added a parent task: T22085: [scap] Local sync script on any individual server should be atomic.

• demon added a subtask: T115899: Move scap target configuration to etcd.Jan 12 2016, 7:05 PM

Joe closed subtask T73212: Make it possible to quickly and programmatically pool and depool application servers as Resolved.Feb 2 2016, 9:51 AM

greg edited projects, added scap2; removed Deployments.Feb 9 2016, 11:52 PM

dduvall moved this task from Needs triage to MediaWiki MVP on the Scap board.Feb 12 2016, 7:53 PM

• mmodell moved this task from MediaWiki MVP to Scap3-MediaWiki-MVP on the Scap board.Mar 4 2016, 6:55 PM

• mmodell edited projects, added Scap (Scap3-MediaWiki-MVP); removed Scap.

• mmodell added a parent task: T125629: Depool proxies temporarily while scap is ongoing to avoid taxing those nodes.Mar 6 2016, 2:17 AM

greg added a project: releng-201617-q2.Mar 28 2016, 11:41 PM

greg edited projects, added releng-201617-q3; removed releng-201617-q2.Jun 22 2016, 7:52 PM

thcipriani moved this task from Needs Triage to Debt on the Scap (Scap3-MediaWiki-MVP) board.Aug 15 2016, 7:12 PM

• demon moved this task from Debt to Needs Triage on the Scap (Scap3-MediaWiki-MVP) board.Oct 5 2016, 4:12 PM

thcipriani moved this task from Needs Triage to Improvements on the Scap (Scap3-MediaWiki-MVP) board.Oct 5 2016, 4:18 PM

thcipriani added a revision: Restricted Differential Revision.Oct 12 2016, 3:35 PM

thcipriani mentioned this in rMSCA8eadbca15769: Fix HHVM restarting routines.Nov 8 2016, 6:34 PM

• Gilles moved this task from Blocked (old) to Radar on the Performance-Team board.Dec 7 2016, 8:52 PM

greg edited projects, added releng-201617-q4; removed releng-201617-q3.Dec 14 2016, 11:00 PM

• mmodell added a parent task: T163565: Install conftool on deployment masters.Apr 21 2017, 5:16 PM

• mmodell removed a parent task: T163565: Install conftool on deployment masters.

• mmodell added a subtask: T163565: Install conftool on deployment masters.

Joe closed subtask T163565: Install conftool on deployment masters as Resolved.May 11 2017, 6:25 AM

• demon mentioned this in rMSCA6c36daed0976: Create a wrapper around conftool for our pooling/depooling needs.Jul 19 2017, 5:39 PM

Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.Aug 8 2017, 3:15 AM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Sep 26 2017, 7:53 PM

• Imarlier removed a project: Performance-Team (Radar).Jun 20 2018, 10:41 AM

Joe closed subtask T115899: Move scap target configuration to etcd as Resolved.Aug 2 2018, 1:34 PM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 8:10 PM

Krinkle removed a parent task: T103886: Translation cache exhaustion caused by changes to PHP code in file scope.Apr 12 2019, 2:50 PM

Krinkle mentioned this in T211488: Audit and sync INI settings as needed between HHVM and PHP 7 .Apr 12 2019, 2:55 PM

Change 514660 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] conftool::scripts: add a safe-service-restart script

https://gerrit.wikimedia.org/r/514660

gerritbot added a project: Patch-For-Review.Jun 6 2019, 7:30 AM

Change 514660 merged by Giuseppe Lavagetto:
[operations/puppet@production] conftool::scripts: add a safe-service-restart script

https://gerrit.wikimedia.org/r/514660

Maintenance_bot removed a project: Patch-For-Review.Jun 6 2019, 10:11 AM

Lucas_Werkmeister_WMDE mentioned this in T225207: Enable scap to roll back broken changes to MediaWiki.Jun 6 2019, 1:26 PM

Krinkle removed a project: HHVM.Oct 3 2019, 3:32 AM