Page MenuHomePhabricator

Streamline our service development and deployment process
Closed, ResolvedPublic

Description

The deployment of a new service currently involves a lot of manual steps, most of them depending on operations. We should figure out a way to streamline this process, so that we

  • can focus more on our main tasks like building solid and secure services or improving instrumentation and monitoring, and
  • move faster without compromising on security and robustness.

While these issues might be the most pressing for service deploys, there are issues with MediaWiki core deploys as well. We discussed general deployment system needs at http://etherpad.wikimedia.org/p/futureofdeployments, and the intention is to find a solution that can be used for all services including MW core.

Here is a summary of the main requirements we identified, with a slight emphasis on what we need for Services:

  • rolling deploys / config changes
    • automated roll-out of config / code changes one {instance,batch} at a time
    • orchestrate pybal with upgrade (zero downtime), T73212
    • wait for individual instances to come back up & check correct functioning (wait for ports, check logs / error metrics, perform test requests etc) before proceeding
    • stop roll-out if more than x% of machines failed
  • needs to be reasonably easy to set up, configure, use and test for developers
    • should scale down for small-scale deployments & testing
  • integration with config management to cleanly orchestrate config changes with code changes
    • allow developers to use proper config management without needing root
    • apply the same rolling deploy / canary precautions for both config and code changes
    • support reasonably easy testing of config changes (ex: --check --diff in Ansible, targeted application of changes to individual nodes)
  • integration with build process, CI and staging
    • automate dependency updates and build process (ex: T94611)
    • what gets deployed to production is identical to what we tested on staging cluster, etc.
  • minimum privilege operation; should have a small attack surface
  • (eventually) consistent deploys
    • code and config versions are explicitly specified in deployment config (audit trail)
    • services don't auto-start on boot until config and code are brought up to date
    • aborted deploys are rolled back cleanly

We also agreed that we need to prototype some possible solutions before we can make a decision. Some of those options are listed in the sub-tasks of this ticket, even though they aren't strictly blocking progress on this task.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
GWicke updated the task description. (Show Details)
GWicke updated the task description. (Show Details)
GWicke updated the task description. (Show Details)
fgiunchedi triaged this task as Medium priority.Apr 2 2015, 9:48 AM
GWicke updated the task description. (Show Details)
GWicke updated the task description. (Show Details)
GWicke updated the task description. (Show Details)
GWicke updated the task description. (Show Details)
GWicke updated the task description. (Show Details)

Goes without saying that the individual services should also be able to work with the fact that multiple versions of the same codebase might be live in production at the same time, especially with canary / rolling deploys.

Another global outage triggered by a puppet config deploy without automatic error checking: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150814-MediaWiki

@GWicke seriously?

That outage has nothing to do with puppet - and it is clearly said so in the report. This had to do with a special condition not triggered in beta and that triggered in production because of high traffic and language conversion. Any system that verified that hhvm had correctly restarted would have given @ori a green light too.

Also that would be the HHVM configuration that stays in puppet for a good reason.

Please stop making a case for something we all agree we want for our own software based on things that have little to nothing to do with it.

Any system that verified that hhvm had correctly restarted would have given @ori a green light too.

Correct, but a system that kept an eye on error rates could potentially have aborted the deploy, as might have an automated canary setup. Whether the error rate increase in this particular case was significant enough to trigger a conservative threshold I don't know, but that's besides the point.

From first hand experience, I know that we have had many deployment-triggered outages that could have been prevented with some automated checking in the deployment system. I do think that we should make an effort to avoid the outages that we can with moderate effort, by improving our deployment tooling when we have the opportunity to, like in this case for services.

@GWicke: Where is the error rate for services logged? I'd like to try my hand at building a monitoring task that watches logs/metrics and attempts to detect anomalous increases during deployments.

@mmodell, the set of metrics and logs to look at depends on the service. For RESTBase, we could for example look at the 5xx rates from graphite. I'm not sure what the options for interfacing with logstash are, but am optimistic that we can figure something out.

Generally, your new system should provide some baseline monitoring (like a port check) out of the box, but also allow adding custom monitoring per service similar to Ansible. Checks like port availability, HTTP requests, grepping a log file or HTTP response should be possible and reasonably easy to configure. We also need support for templating metric names or web requests, so that we can retrieve the data corresponding to the cluster or node we are deploying to (labs, staging, prod etc).

@mmodell, the set of metrics and logs to look at depends on the service. For RESTBase, we could for example look at the 5xx rates from graphite. I'm not sure what the options for interfacing with logstash are, but am optimistic that we can figure something out.

Generally, your new system should provide some baseline monitoring (like a port check) out of the box, but also allow adding custom monitoring per service similar to Ansible. Checks like port availability, HTTP requests, grepping a log file or HTTP response should be possible and reasonably easy to configure. We also need support for templating metric names or web requests, so that we can retrieve the data corresponding to the cluster or node we are deploying to (labs, staging, prod etc).

Ok I was able to figure out how to query graphite from python. So far I have a class that fetches json and analyzes the data to compute current, max historical and average error rate over a given time range.

What kind of configuration parameters would be useful? Should the service deployment config define the baseline & threshold in absolute terms or should the deployment tool try to be smart about it and automatically determine the threshold from recent historical data?

Without putting too much thought into it so far, I came up with these possible strategies:

  • Alert when error rate increases by at least 1 standard deviation
  • Alert when error rate exceeds the max error rate seen over the past x minutes (configurable?)

@mmodell you could be interested in the check_graphite nagios script we use - it has threshold alerts and alerts based on Holt-Winters forecasting too.

@Joe: thanks, that seems a lot more powerful than the ideas I had come up with so far.

So we can call the nagios check scripts directly from python?

https://github.com/wikimedia/operations-puppet/blob/acacf97e2df962fef83487a461f3559fa07e4d6f/modules/monitoring/manifests/graphite_threshold.pp is the puppet config, still haven't looked at the check_graphite script directly, gonna go do some more reading now. Thanks for the suggestions!

What kind of configuration parameters would be useful?

The very first thing would be a way to configure which metrics / logs to look at in a DC / cluster / node agnostic way. Conventionally, this is done by templating check parameters with variables defined in the config system portion of your project.

For metrics, a conservative static threshold would already be a good and predictable start. Before we use Hold-Winters in deploys, it might be a good idea to first try it in an alert. It should also be easy to override this threshold in emergencies (or override checks in general).

It would also be a good idea to call @Joe's monitoring script, as that will automatically probe most entry points of the service based on the API spec.

Timely blog post on lessons learned from post-mortems:

Configuration bugs, not code bugs, are the most common cause I’ve seen of really bad outages, and nothing else even seems close. When I looked at publicly available postmortems, searching for “global outage postmortem” returned about 50% outages caused by configuration changes. Publicly available postmortems aren’t a representative sample of all outages, but a random sampling of postmortem databases also reveals that config changes are responsible for a disproportionate fraction of extremely bad outages. As with error handling, I’m often told that it’s obvious that config changes are scary, but it’s not so obvious that most companies test and stage config changes like they do code changes.

Except in extreme emergencies, risky code changes are basically never simultaneously pushed out to all machines because of the risk of taking down a service company-wide. But it seems that every company has to learn the hard way that seemingly benign config changes can also cause a company-wide service outage. For example, this was the cause of the infamous November 2014 Azure outage. I don’t mean to pick on MS here; their major competitors have also had serious outages for similar reasons, and they’ve all put processes into place to reduce the risk of that sort of outage happening again.

What kind of configuration parameters would be useful?

The very first thing would be a way to configure which metrics / logs to look at in a DC / cluster / node agnostic way. Conventionally, this is done by templating check parameters with variables defined in the config system portion of your project.

For metrics, a conservative static threshold would already be a good and predictable start. Before we use Hold-Winters in deploys, it might be a good idea to first try it in an alert. It should also be easy to override this threshold in emergencies (or override checks in general).

It would also be a good idea to call @Joe's monitoring script, as that will automatically probe most entry points of the service based on the API spec.

We've used holt-winters repeatedly in alerting, and it needs work to be useful - the current state of using it for alerting is that sometimes it catches problems before the hard thresholds, most of the times later than that, and in general generates too many false positives to be really useful.

Timely blog post on lessons learned from post-mortems:

Configuration bugs, not code bugs, are the most common cause I’ve seen of really bad outages, and nothing else even seems close. When I looked at publicly available postmortems, searching for “global outage postmortem” returned about 50% outages caused by configuration changes.

3 of Parsoid's 4 bad deploys over the last 6 months could be directly traced to a bad config (and the fact that staging and production configs are different).

In 2 cases, the problems were fixed in under 5 mins (after the cluster went down for that period), but a canary deploy would have prevented those. In the 3rd case, the problem was caught as part of our manual canary deploy that we now use (restart parsoid on one node and monitor graphs for a while before restarting everything).

I have links for all these if someone wants more info.

So what we really need, in my understanding, is to introduce canary hosts in all our clusters, and also allow to identify at the frontend (e.g. varnish) level what traffic is being directed to those, and make it easy to see if frontend errors come from those hosts.

The idea of canary releasing is IMO superior, if well executed, than just doing rolling restarts and automatic checks.

The workflow would be:

  • Release to the canaries
  • Watch for errors, and possibly a dedicated dashboard with ratios of errors for canary users vs normal users
  • Release to the whole cluster

This will need a lot of work to be done right and I don't think it's going to be in place very soon. For now it would be good for scap to be able to get a list of canaries and release just to those.

@Joe, we probably want *both* a canary deploy *and* a rolling deploy in general. With RB, we tend to deploy to one node first, then let it run for 30 minutes or so while checking logs and metrics before continuing with the rolling deploy. Basically, the only difference between the canary portion & the general rolling deploy is the wait time before proceeding.

So a canary is just a rolling deploy with a specified starting host (canary) followed by a modal "continue y/n" prompt, and finally a full rolling deploy to the remaining hosts?

I agree with @Joe, some advanced monitoring and perhaps varnish integration would be ideal. We can strive to work towards that incrementally.

@mmodell "canaries" usually serve a small amount of the service traffic, say 10%, so they're more than one server in general.

In my mind, a deploy in an "ideal world" would work like follows:

"deploy canary" => will deploy the new code to all the canary hosts, which have been configured

an human can check how things are going, and hopefully we'll also have dedicated monitoring aggregatiions (like, how many 5xx responses are coming from the canary pool etc)

"deploy all" ==> will deploy the new code to all hosts that are still not using the new code.

This is not much different from what services does with RB - only that you'll have a test pool that is not just strictly one host.

As for rolling deploys: you can probably think that it's safe to assume you can deploy to a certain percentage of the cluster at the same time. It will practically mean a rolling deploy on RB and other non-huge clusters, and a few servers at the same time on the larger ones, e.g. mediawiki and parsoid at the moment.

Following up on subpoints, point-by-point, expanding on T109535#1691326

  • rolling deploys / config changes
    • automated roll-out of config / code changes one {instance,batch} at a time

Scap3 deploys both code and config in a rolling way. The scap.cfg variable batch_size defines the parallelism of all deploy stages. There is also a per-stage batch_size (e.g., fetch_batch_size) that allows more control at a per-stage-per-deployed-repository level.

  • orchestrate pybal with upgrade (zero downtime), T73212

This is a discussion we will soon start with ops, but is not implemented

  • wait for individual instances to come back up & check correct functioning (wait for ports, check logs / error metrics, perform test requests etc) before proceeding

We currently have merged only a port-check; however, there is a patch in review that will merge shortly that adds more checks (see D11).

  • stop roll-out if more than x% of machines failed

We currently stop deployment if any of the config-deployment, fetch, or checkout fails in a batch. A max-fail percentage becomes nebulous when you allow for parallelism.

  • needs to be reasonably easy to set up, configure, use and test for developers
    • should scale down for small-scale deployments & testing

@dduvall created vagrant machine that creates and deploys to 10 LXC containers for testing (https://github.com/marxarelli/scap-vagrant). This is a really lightweight approach, and is a reasonable model for future deploy testing.

  • integration with config management to cleanly orchestrate config changes with code changes
    • allow developers to use proper config management without needing root

Troubleshooting without root should be relatively easy with Scap3. If any remote command fails, a full stack trace along with the error and the full command will be dumped to the console. These logs are also sent to Logstash.

Since Scap3 makes use of a remote user that's ssh key is stored in keyholder, it should be easy to ssh to the machine as the deploy_repo_user (defined in the per-repo scap.cfg) to try running any commands that may have failed.

  • apply the same rolling deploy / canary precautions for both config and code changes

Config changes and code changes are rolled out using parallelism per stage. This is true for the canary config and code deploy as well as the regular config and code deploys. This allows a rolling deploy of both config and code, or a parallel fetch and a completely serial checkout ("promote" is the term). That is, if you define a batch_size: 1 and fetch_batch_size: 80 in scap.cfg the config deployment (config_deploy step) and code deployment (promote step) will happen serially, one at a time, but code will be fetched from tin with a parallelism of 80 (hosts will all fetch from tin in batches of 80)

It should be noted that canary deployment is in review, but will merge shortly.

  • support reasonably easy testing of config changes (ex: --check --diff in Ansible, targeted application of changes to individual nodes)

Still being developed, but should be done shortly, the current idea is a --test flag that shows what would change as well as a diff for configuration deployment.

  • integration with build process, CI and staging
    • automate dependency updates and build process (ex: T94611)
    • what gets deployed to production is identical to what we tested on staging cluster, etc.

Integration with beta is slated for (hopefully early) this quarter.

  • minimum privilege operation; should have a small attack surface

We've made some changes to keyholder to make sure that it checks that the deployer is in a group that is authorized to use a particular key. By keeping authorized groups small and meaningful it limits accidental errors from people who are authorized to use a key in keyholder, but perhaps not deploy anything for the service team.

Further, the design of Scap utilizes an normal user (defined in scap.cfg as deploy_repo_user), any service restarts, any files that user needs to create, will all have to be setup through puppet.

  • (eventually) consistent deploys
    • code and config versions are explicitly specified in deployment config (audit trail)

Code revision deployment is specified during deploy: deploy --rev [revision] config deployment is agnostic to whether or not git is used on tin, but config templates are to be found in /srv/deployment/[repo]/scap/templates/[template-name].j2 (there is also an environment flag that could potentially affect that path [i.e., deploy -e staging would look for a template in /srv/deployment/[repo]/scap/staging/templates/[template-name].j2]). The revision is logged both in Logstash as well as in the repository on tin using 2 methods: git annotated tags, and a file at /srv/deployment/[repo]/.git/DEPLOY_HEAD that shows the most recently deployed revision.

  • services don't auto-start on boot until config and code are brought up to date

This is likely something we'll look at with a puppet scap3 provider that is yet to be implemented but will likely be an undertaking in the coming quarter.

  • aborted deploys are rolled back cleanly

Scap3 uses a symlink method fairly similar to capistrano to support clean rollbacks: at /srv/deployment/[repo]-cache on tin there are 3 directories: revs, current, and cache code is fetched from tin to cache, code is then checked out to a new directory under revs named with the pattern: sha1-of-compiled-config-files_sha1-of-deployed-revision. The "promote" stage of deployment (the only stage that really needs to be run in serial/batches for a rolling deploy) is simply a symlink swap from one rev-dir to a new rev-dir. After that swap happens, the configured checks will run (port and icinga) and if there are any failures the deployer will be prompted to rollback. The rollback is simply another symlink swap, so it is quick and atomic.

akosiaris subscribed.

Unsure, why this is Blocked-on-Operations. I 'll remove the tag for now, feel free to readd.

@akosiaris, I added the tag to reflect that several aspects of these requirements (especially config management / testing / deploys) are blocked by operations. These aspects are considered out of scope for Scap3, proposals like T107532 have been blocked, and no solution has been proposed from ops.

There is hope that the move towards containers will provide the impetus to make progress here. Until then, I am proposing to re-add the tag to reflect the current blocked state.

Seems like we have different meanings for Blocked-on-Operations. For me the tag means "somehow operations is blocking someone and there is something concrete that can be done to resolve the block". At least that's the concept I try to have in mind whenever I am on Clinic duty and looking at the clinic duty dashboard https://phabricator.wikimedia.org/dashboard/view/45/. It helps to see what needs doing and proceed into actions that will resolve the block. Of course this mental model makes no sense with huge tracking tasks like this one. At which point the danger of a mental model of "er, yeah I am on clinic duty, I should unblock things but there is nothing I can do right now about this task, so ignore" emerges.

If you want to use Scrum terms, this task is very close to an Epic. I would suggest to break it down into actual actionable tasks and marked those as Blocked-on-Operations in order to maintain some viable functionality for this task. Otherwise I am very much afraid it will lose its meaning.

Until then, I am proposing to re-add the tag to reflect the current blocked state.

I would suggest stalling it instead. You 've already used the word "state" to describe the situation the task is currently in, so I 'd say it makes the most sense to use the actual phabricator field Status to reflect that.

We discussed this ticket WRT to scap3 work briefly in the deployment working group meeting today (https://www.mediawiki.org/wiki/Deployment_tooling/Cabal/2016-05-09#Blocking_RESTBase)

It seems the main things left to untangle are:

  1. Configuration deployments

These are implemented in Scap3, but haven't been tested in production—there is some puppet work that needs to be done to support this.

  1. Easy setup and testing

Untangling the existing Scap pieces from the deployment::server pieces in puppet. @Krenair did some work that benefits this recently (https://gerrit.wikimedia.org/r/#/c/284851)

  1. Config --diff and --test @mmodell has been doing some important refactoring work currently blocking --diff.

Scap3 should address the other key points here.

GWicke claimed this task.

Changing status to resolved, as much (but not all) of the requirements discussed here are implemented in scap3 now. The remaining ones like ease of setup will be useful to consider for Kubernetes based deploys. See this document for related draft requirements.