The deployment of a new service currently involves [a lot of manual steps](https://www.mediawiki.org/wiki/Services/Meetings/2015-03-19-Ops), most of them depending on operations. We should figure out a way to streamline this process, so that we
- can focus more on our main tasks like building solid and secure services or improving instrumentation and monitoring, and
- move faster without compromising on security and robustness.
While these issues might be the most pressing for service deploys, there are issues with MediaWiki core deploys as well. We discussed general deployment system needs at http://etherpad.wikimedia.org/p/futureofdeployments, and the intention is to find a solution that can be used for all services including MW core.
Some of the requirements identified in that meeting were:
- rolling deploys / config changes
- automated roll-out of config / code changes one {instance,batch} at a time
- orchestrate pybal with upgrade (zero downtime), T73212
- wait for individual instances to come back up & check correct functioning (wait for ports, check logs, perform test requests etc) before proceeding
- stop roll-out if more than x% of machines failed
- needs to be reasonably easy to set up, configure, use and test for developers
- should scale down for small-scale deployments & testing
- integration with build process, CI and staging
- automate dependency updates and build process (ex: T94611)
- what gets deployed to production is identical to what we tested on staging cluster, etc.
- integration with config management to cleanly orchestrate config changes with code changes
- allow developers to use proper config management without needing root
- minimum privilege operation; should have a small attack surface
We also agreed that we need to prototype some possible solutions before we can make a decision. Some of those options are listed in the sub-tasks of this ticket, even though they aren't strictly blocking progress on this task.