Page MenuHomePhabricator

[EPIC] The future of MediaWiki deployment: Tooling
Open, NormalPublic

Description

We need something that works for services, mediawiki, phabricator, etc.

Synthesize features from scap and trebuchet, to create One Deployment Tool to Rule Them All

  1. Fan-out deployment to efficiently distribute the bits to 100s of servers. scap deploys via proxy deployment servers which are physically close to the group of target nodes that they serve. We are considering a git-based deployment that would be somewhat analogous to the way scap does it:
    • We could set up a few dedicated deployment hosts which periodically git-fetch the relevant git repository so that they are always primed with most of the changes
    • the amount of data to fetch with each run should be small.
    • when we deploy a tag, target nodes fetch from the nearest deployment host and check out the requisite tag.
    • we need to avoid having multiple full copies of the source tree synced this way. By fixing mediawiki release branching we can drastically reduce the amount of data sync'd at each deployment. Once branching is done sanely then git will only need to transfer the deltas instead of the entire tree.

Related Objects

StatusAssignedTask
OpenNone
OpenNone
StalledNone
OpenNone
Resolveddemon
Declinedmmodell
ResolvedLegoktm
Resolved GWicke
OpenNone
Resolved GWicke
Declined GWicke
Resolvedthcipriani
DeclinedNone
Resolvedmobrovac
Resolvedakosiaris
Resolvedakosiaris
Declinedmmodell
InvalidNone
Resolvedmmodell
ResolvedJdforrester-WMF
Declinedmmodell
Resolvedmmodell
Resolvedmmodell
Resolvedmmodell
OpenNone
ResolvedKrinkle
ResolvedKrinkle
Resolvedmmodell
DuplicateKrinkle
ResolvedKrinkle
ResolvedKrinkle
ResolvedMaxSem
ResolvedKrinkle
ResolvedKrinkle
ResolvedKrinkle
ResolvedKrinkle
ResolvedKrinkle
Resolvedmmodell
OpenNone
ResolvedJoe
ResolvedNone
ResolvedJoe

Event Timeline

mmodell created this task.Mar 31 2015, 8:43 PM
mmodell updated the task description. (Show Details)
mmodell raised the priority of this task from to Needs Triage.
mmodell added a project: Deployments.
mmodell added subscribers: mmodell, greg.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 31 2015, 8:43 PM

What is this task about? (Some conference presentation? Or just Epic with a corresponding epic title? :P )

greg triaged this task as Normal priority.Apr 2 2015, 5:08 PM
greg set Security to None.
greg moved this task from To Triage to Backlog (Tech) on the Deployments board.
mmodell renamed this task from The future of MediaWiki deployment: Tooling to EPIC: The future of MediaWiki deployment: Tooling.Apr 3 2015, 10:48 PM
mmodell updated the task description. (Show Details)
mmodell updated the task description. (Show Details)Apr 10 2015, 7:00 AM

@dduvall, @thcipriani, @demon: This is a fairly helpful high level overview/comparison of salt and ansible, maybe worth a read: http://jensrantil.github.io/salt-vs-ansible.html

In order to depool a server (and not trigger false alarms in the alerting system) we would like to have a deployment flag (a lock file or something) that causes the pybal check to fail but in a way that lets pybal know that this is a temporary and expected downtime. Then some custom logic in pybal would temporarily depool the server but continue checking for the status to return to normal (and bypass alerting)

So, in our current configuration, pybal checks the individual servers by ssh, which uses a "forced command" on the target server to run this:

uptime; touch /var/tmp/pybal-check.stamp

It also checks http but only on the varnish proxy not on the individual apache nodes.

Changing the force command to something like this would probably do the trick:

[ -f "/tmp/deploy.lock" ] && stat --format=deployment:%Y /tmp/deploy.lock && exit 123 || uptime;

This will exit with status 123 when the lock file exists, and outputs the last modified time of the lock file as "deployment:timestamp" so that pybal can know a) that it's down for deployment and b) when the deployment lock was last updated. We could then add some intelligence to pybal to alert if a host is stuck in deployment...the deployment process could periodically touch the lock file to indicate that it is in fact progressing and not hung somewhere.

Sound good?

So pybal already has a built in (and configurable) threshold limiting how many servers can be depooled at any given time. This addresses @chasemp's concern about silently depooling most of the cluster if something goes wrong with a deploy.

it looks like all we need is the change above to implement the depool during deploy via creating a simple lock file and removing it at the end of the deployment process.

We should have some sort of self-checks to be sure that everything is kosher before removing the lock.

greg added a comment.May 23 2015, 8:52 AM

I just put this as a session for tomorrow at 2pm Lyon time per https://www.mediawiki.org/wiki/Wikimedia_Hackathon_2015/Program

mmodell closed subtask Restricted Task as Resolved.Jun 2 2015, 4:36 PM
GWicke added a comment.EditedJun 4 2015, 8:18 AM

@mmodell, re monitoring / depooling: You could consider directly checking etcd for this. See T100793.

greg reopened subtask Restricted Task as Open.Jun 4 2015, 3:50 PM
mmodell closed subtask Restricted Task as Resolved.Jun 23 2015, 3:46 PM
mmodell updated the task description. (Show Details)Jul 1 2015, 10:16 PM
greg moved this task from Backlog to Epics on the Release-Engineering-Team board.Mar 11 2016, 10:09 PM
Luke081515 renamed this task from EPIC: The future of MediaWiki deployment: Tooling to [EPIC] The future of MediaWiki deployment: Tooling.Mar 22 2016, 6:26 PM