Page MenuHomePhabricator

[EPIC] The future of MediaWiki deployment: Tooling
Closed, ResolvedPublic

Description

We need something that works for services, mediawiki, phabricator, etc.

Synthesize features from scap and trebuchet, to create One Deployment Tool to Rule Them All

  1. Fan-out deployment to efficiently distribute the bits to 100s of servers. scap deploys via proxy deployment servers which are physically close to the group of target nodes that they serve. We are considering a git-based deployment that would be somewhat analogous to the way scap does it:
    • We could set up a few dedicated deployment hosts which periodically git-fetch the relevant git repository so that they are always primed with most of the changes
    • the amount of data to fetch with each run should be small.
    • when we deploy a tag, target nodes fetch from the nearest deployment host and check out the requisite tag.
    • we need to avoid having multiple full copies of the source tree synced this way. By fixing mediawiki release branching we can drastically reduce the amount of data sync'd at each deployment. Once branching is done sanely then git will only need to transfer the deltas instead of the entire tree.

Related Objects

StatusSubtypeAssignedTask
DeclinedNone
OpenNone
Resolved demon
Declined mmodell
ResolvedLegoktm
Resolved GWicke
Resolved mmodell
Resolved GWicke
Declined GWicke
Resolvedthcipriani
DeclinedNone
Resolved mobrovac
Resolvedakosiaris
Resolvedakosiaris
Declined mmodell
InvalidNone
Resolved mmodell
ResolvedJdforrester-WMF
Declined mmodell
Resolved mmodell
Resolved mmodell
Resolved mmodell
Resolved dduvall
ResolvedKrinkle
ResolvedKrinkle
Resolved mmodell
DuplicateKrinkle
ResolvedKrinkle
ResolvedKrinkle
ResolvedPRODUCTION ERRORMaxSem
ResolvedKrinkle
ResolvedKrinkle
ResolvedKrinkle
ResolvedKrinkle
ResolvedKrinkle
Resolved mmodell
ResolvedNone
ResolvedJoe
ResolvedNone
ResolvedJoe

Event Timeline

mmodell raised the priority of this task from to Needs Triage.
mmodell updated the task description. (Show Details)
mmodell added a project: Deployments.
mmodell added subscribers: mmodell, greg.

What is this task about? (Some conference presentation? Or just Epic with a corresponding epic title? :P )

greg triaged this task as Medium priority.Apr 2 2015, 5:08 PM
greg set Security to None.
greg moved this task from To Triage to Backlog (Tech) on the Deployments board.
mmodell renamed this task from The future of MediaWiki deployment: Tooling to EPIC: The future of MediaWiki deployment: Tooling.Apr 3 2015, 10:48 PM
mmodell updated the task description. (Show Details)

@dduvall, @thcipriani, @demon: This is a fairly helpful high level overview/comparison of salt and ansible, maybe worth a read: http://jensrantil.github.io/salt-vs-ansible.html

In order to depool a server (and not trigger false alarms in the alerting system) we would like to have a deployment flag (a lock file or something) that causes the pybal check to fail but in a way that lets pybal know that this is a temporary and expected downtime. Then some custom logic in pybal would temporarily depool the server but continue checking for the status to return to normal (and bypass alerting)

So, in our current configuration, pybal checks the individual servers by ssh, which uses a "forced command" on the target server to run this:

uptime; touch /var/tmp/pybal-check.stamp

It also checks http but only on the varnish proxy not on the individual apache nodes.

Changing the force command to something like this would probably do the trick:

[ -f "/tmp/deploy.lock" ] && stat --format=deployment:%Y /tmp/deploy.lock && exit 123 || uptime;

This will exit with status 123 when the lock file exists, and outputs the last modified time of the lock file as "deployment:timestamp" so that pybal can know a) that it's down for deployment and b) when the deployment lock was last updated. We could then add some intelligence to pybal to alert if a host is stuck in deployment...the deployment process could periodically touch the lock file to indicate that it is in fact progressing and not hung somewhere.

Sound good?

So pybal already has a built in (and configurable) threshold limiting how many servers can be depooled at any given time. This addresses @chasemp's concern about silently depooling most of the cluster if something goes wrong with a deploy.

it looks like all we need is the change above to implement the depool during deploy via creating a simple lock file and removing it at the end of the deployment process.

We should have some sort of self-checks to be sure that everything is kosher before removing the lock.

mmodell closed subtask Restricted Task as Resolved.Jun 2 2015, 4:36 PM

@mmodell, re monitoring / depooling: You could consider directly checking etcd for this. See T100793.

greg reopened subtask Restricted Task as Open.Jun 4 2015, 3:50 PM
mmodell closed subtask Restricted Task as Resolved.Jun 23 2015, 3:46 PM
Luke081515 renamed this task from EPIC: The future of MediaWiki deployment: Tooling to [EPIC] The future of MediaWiki deployment: Tooling.Mar 22 2016, 6:26 PM
mmodell claimed this task.

I'm going to close this as it's no longer actively worked on and all direct subtasks have been closed.