The deployment of a new service currently involves [a lot of manual steps](https://www.mediawiki.org/wiki/Services/Meetings/2015-03-19-Ops), most of them depending on operations. We should figure out a way to streamline this process, so that we
- can focus more on our main tasks like building solid and secure services or improving instrumentation and monitoring, and
- move faster without compromising on security and robustness.
While these issues might be the most pressing for service deploys, there are issues with MediaWiki core deploys as well. We discussed general deployment system needs at http://etherpad.wikimedia.org/p/futureofdeployments, and the intention is to find a solution that can be used for all services including MW core.
Here is a summary of the main requirements we identified, with a slight emphasis on what we need for #services:
- rolling deploys / config changes
- automated roll-out of config / code changes one {instance,batch} at a time
- orchestrate pybal with upgrade (zero downtime), T73212
- wait for individual instances to come back up & check correct functioning (wait for ports, check logs, perform test requests etc) before proceeding
- stop roll-out if more than x% of machines failed
- needs to be reasonably easy to set up, configure, use and test for developers
- should scale down for small-scale deployments & testing
- integration with config management to cleanly orchestrate config changes with code changes
- allow developers to use proper config management without needing root
- apply the same rolling deploy / canary precautions for both config and code changes
- support reasonably easy testing of config changes (ex: `--check --diff` in Ansible, targeted application of changes to individual nodes)
- integration with build process, CI and staging
- automate dependency updates and build process (ex: T94611)
- what gets deployed to production is identical to what we tested on staging cluster, etc.
- minimum privilege operation; should have a small attack surface
- (eventually) consistent deploys
- code and config versions are explicitly specified in deployment config (audit trail)
- services don't auto-start on boot until config and code are brought up to date
- aborted deploys are rolled back cleanly
We also agreed that we need to prototype some possible solutions before we can make a decision. Some of those options are listed in the sub-tasks of this ticket, even though they aren't strictly blocking progress on this task.