The deployment of a new service currently involves [a lot of manual steps](https://www.mediawiki.org/wiki/Services/Meetings/2015-03-19-Ops), most of them depending on operations. We should figure out a way to streamline this process, so that we
- can focus on really useful things like building really solid and secure services or improving instrumentation and monitoring, and
- move faster without compromising on security and robustness.
While these issues might be the most pressing for service deploys, there are issues with MediaWiki core deploys as well. We discussed general deployment system needs at http://etherpad.wikimedia.org/p/futureofdeployments.
Some of the requirements identified in that meeting were:
- all teams are moving towards rolling deploys
- orchestrate pybal with upgrade (zero downtime), T73212
- check status & metrics before proceeding (canary) and automatically abort if checks fail
- integration with CI / staging
- what gets deployed to production is identical to what we tested on staging cluster, etc.
- minimum privilege operation
- should be easy to use for developers & scale down for small-scale development & testing
- integration with config management to cleanly orchestrate config changes with code changes
- allow developers to use proper config management without needing root
We also agreed that we need to prototype some possible solutions before we can make a decision. The options we consider are discussed in sub-tasks of this ticket.