The deployment of a new service currently involves [a lot of manual steps](https://www.mediawiki.org/wiki/Services/Meetings/2015-03-19-Ops), most of them depending on operations. We should figure out a way to streamline this process, so that we
- can focus more on our main tasks like building solid and secure services or improving instrumentation and monitoring, and
- move faster without compromising on security and robustness.
While these issues might be the most pressing for service deploys, there are issues with MediaWiki core deploys as well. We discussed general deployment system needs at http://etherpad.wikimedia.org/p/futureofdeployments, and the intention is to find a solution that can be used for all services including MW core.
Some of the requirements identified in that meeting were:
- all teams are moving towards rolling deploys / config changes
- automated rolling roll-out of config / code changes: update code/config, restart service one (batch) at a time
- orchestrate pybal with upgrade (zero downtime), T73212
- check status & metrics before proceeding (canary) and automatically abort if checks failwait for individual services to come back up & check correct functioning (wait for ports, check logs etc) before proceeding
- stop roll-out if more than x% of machines failed
- integration with CI / staging
- what gets deployed to production is identical to what we tested on staging cluster, etc.
- minimum privilege operation
- should be easy to use for developers & scale down for small-scale development & testing
- integration with config management to cleanly orchestrate config changes with code changes
- allow developers to use proper config management without needing root
- support a sane / automated roll-out of code and config changes:
- rolling restarts in coordination with config / code change
- waiting for individual services to come back up & ability to run checks (wait for ports etc) before proceesing,
- integrate with pooling / depooling
- stop roll-out if more than x% of machines failed to restart
We also agreed that we need to prototype some possible solutions before we can make a decision. Some of those options are listed in the sub-tasks of this ticket, even though they aren't strictly blocking progress on this task.