Page MenuHomePhabricator

Evaluate Ansible as a deployment tool
Closed, DeclinedPublic

Description

Ansible is a config management / orchestration system that doesn't require a central server. It uses SSH as its transport, which makes it relatively easy to use even in small setups.

It has relatively good support for rolling upgrades built in, along with modules for git, apt, docker and other typical installation methods.

A fairly simple use of Ansible would be to replicate trebuchet functionality with the git module. This should already provide us with a good feel for its rolling deploy capabilities.

See also:

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke added subscribers: fgiunchedi, mark, Joe and 6 others.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 20 2015, 9:35 PM
GWicke updated the task description. (Show Details)Mar 20 2015, 9:37 PM
GWicke set Security to None.
mmodell claimed this task.Mar 31 2015, 6:31 AM

I think ansible looks really cool and I'm going to play around with it to see how it might fit.

greg triaged this task as Normal priority.Apr 2 2015, 5:08 PM
greg moved this task from To Triage to Backlog (Tech) on the Deployments board.
GWicke updated the task description. (Show Details)Apr 6 2015, 1:54 AM
GWicke added a subscriber: ori.EditedApr 6 2015, 3:08 PM

I looked a little bit into integrating with hiera data. There is a rudimentary lookup module, but it does not follow regular Ansible lookup module semantics, and doesn't work in the master-slave puppet setup we are running.

I tried various hiera cli invocations (ex: sudo hiera restbase::seeds -d ::role=restbase ::fqdn=deployment-restbase01.eqiad.wmflabs) on the puppet master in labs, but didn't find a way to pass in role-specific facts to trigger the role & host specific lookup. IIRC @ori mentioned that he wrote a hiera lookup tool, which might be just the ticket here.

Another option could be to dump puppet/hiera variables to a root-readable file on each host. Disadvantage is that reading this data would require root on the target host.

@GWicke, FWIW I think the hiera invocation you need for labs on deployment-salt is:

sudo RUBYLIB=/var/lib/puppet/lib hiera --debug restbase::seeds ::instanceproject=deployment-prep ::hostname=deployment-restbase01 -c /etc/puppet/hiera.yaml

@thcipriani, thank you! Given this we should be able to put something together based on https://gist.github.com/mrbanzai/8720298, which is also shelling out to hiera.

mmodell reassigned this task from mmodell to GWicke.Apr 14 2015, 5:39 PM

Although ansible looks really cool I think it's going to be a tough sell as long as we are using puppet and salt as our 'offical' configuration management and cluster administration tooling.

Also, release engineering has been attempting to come to terms with all of the problems with current deployment tooling in order to formulate a plan for how we can converge on something that works for everyone.

The current plan involves making improvements to trebuchet to address any deficiencies and hopefully meet everyone's needs.

@GWicke: Although I like ansible and would personally advocate it's use, I think it's going to require a strong argument for why we can't achieve the same results with existing tools.

@mmodell, our requirements are described in T93428. If you can find a solution that genuinely addresses those (or at least have a clear plan for something that will & steps for getting there) then I'll be happy.

Could you create a ticket and describe how you plan to address the requirements?

@GWicke: we are still working on the plan, T94620 will track that work and I will try to address your requirements there. I'm also adding T93428 as a blocker for that task.

GWicke added a comment.EditedApr 14 2015, 6:00 PM

we are still working on the plan, T94620 will track that work and I will try to address your requirements there.

@mmodell, right now I see nothing in T94620 that describes how our requirements would be addressed. I would suggest that making an apparent decision ("The current plan of attack is to enhance trebuchet ") without a clear plan is rather premature. Our focus at this point should be to evaluate options based on the requirements, and then make a decision on the best way forward.

@GWicke: the current task is developing the plan. Once we have a workable plan that does address everyone's needs, and seems achievable based on our research, then we can make any final decisions. I was just mentioning it (perhaps prematurely) in order to keep you in the loop on what we have been working on.

the current task is developing the plan

Okay, looking forward to hearing more. You might also want to correct the wording about trebuchet to reflect that no decision has been made.

@GWicke: I'm curious how you envisioned using ansible. I suspect that one of the main reasons that ansible is appealing is because it would be more straightforward to debug when things go wrong and would reduce dependence on the operations team - allowing faster development and deployment cycles. This is one of the major advantages of something like ansible, at least from my perspective. Did you get the same impression or can you see other advantages that I'm missing?

GWicke added a comment.EditedApr 14 2015, 9:15 PM

@mmodell, my impression is that Ansible lets us address many of the requirements outlined in T93428 in a fairly simple and straightforward way. I have played a tiny bit with its rolling deploy capability, and liked what I saw.

Do you have an idea for a good evaluation task? Proper rolling deploys with health checks, pybal orchestration etc would be my default, but maybe you can think of something that's more interesting?

Do you have an idea for a good evaluation task? Proper rolling deploys with health checks, pybal orchestration etc would be my default, but maybe you can think of something that's more interesting?

That does seem like a good baseline. It seems like ansible is especially well suited for this exact use-case and it will take a bit of work to make trebuchet do it as well.

One issue with ansible is that it uses ssh as the transport and the authentication mechanism. This is both good and bad thing. It's good because it's somewhat simple and understood. It's bad because it requires quite a bit of key management and distribution, while also opening up quite a lot of security attack surface. I'm really not a fan of salt myself but it's already in use and provides transport and authentication, we would have to duplicate all of that to get ansible up and running.

Although I think ansible might be worth duplication simply because it already has a well developed rolling deployment framework, so far the people I've talked to haven't been terribly receptive to the idea of adding one more deployment/configuration management/deployment system when we already have two of each in our production environment.

The argument for salt/trebuchet is that it's simple enough and extensible enough that it won't be difficult to add the rolling restarts and health checks, and we won't have to deal with ssh key management / firewall rules / etc.

So I'm not 100% convinced either way is the best, but trebuchet is certainly the one with the least opposition from Operations.

Do you have an idea for a good evaluation task? Proper rolling deploys with health checks, pybal orchestration etc would be my default, but maybe you can think of something that's more interesting?

That does seem like a good baseline. It seems like ansible is especially well suited for this exact use-case and it will take a bit of work to make trebuchet do it as well.
One issue with ansible is that it uses ssh as the transport and the authentication mechanism. This is both good and bad thing. It's good because it's somewhat simple and understood. It's bad because it requires quite a bit of key management and distribution, while also opening up quite a lot of security attack surface. I'm really not a fan of salt myself but it's already in use and provides transport and authentication, we would have to duplicate all of that to get ansible up and running.

FWIW, with trebuchet/salt we traditionally need ssh to actually restart the services anyway, due to T63882. I also don't think that storing a private key owned / readable only by a system user on a deploy host for deployment tasks is any harder than managing the trebuchet sudo rules. We can also continue to use our individual keys, of course.

As for attack surface: We have SSH installed on all boxes. Whether we grant devs direct shell access or not is up to us. I'm actually more scared about salt running as root on each node.

Although I think ansible might be worth duplication simply because it already has a well developed rolling deployment framework, so far the people I've talked to haven't been terribly receptive to the idea of adding one more deployment/configuration management/deployment system when we already have two of each in our production environment.

Sadly the combination of those two systems does not really address our requirements. I think it's important that we get this right. We need to think ahead enough to avoid needing to look for yet another deployment solution soon.

The argument for salt/trebuchet is that it's simple enough and extensible enough that it won't be difficult to add the rolling restarts and health checks, and we won't have to deal with ssh key management / firewall rules / etc.

If you ever set up your own salt master and trebuchet in a new labs project, you might not be inclined to use 'salt', 'trebuchet' and 'simple' in the same sentence again ;)

trebuchet is certainly the one with the least opposition from Operations.

Lets base our evaluation primarily on technical properties, and less on political issues. Technical issues will be with us for a long time, while political ones come & go.

@GWicke: I'm totally with you on pretty much every point, though I'm not the only one that has to be convinced.

GWicke updated the task description. (Show Details)Apr 19 2015, 1:12 AM

There's a team that's working on deployment right? Are you a member of that team @GWicke? Proposing an alternative outside of that team means you're actively fighting them, making their job harder and wasting everyone's time.

You've been at this for a year. If you're really interested in building a deployment system, join the team or give up on this.

As for Ansible itself, see my quite extensive blog post on this:

http://ryandlane.com/blog/2014/08/04/moving-away-from-puppet-saltstack-or-ansible/

Ansible using SSH always looks like a great reason to use it at first, but it breaks down pretty quickly. You get all the downsides of dealing with SSH and none of the pluses of having minion/master.

Also, if you *really* want to use SSH, salt has salt-ssh, which has the ability to use ansible rosters, if you're in love with the roster format.

mmodell added a comment.EditedApr 26 2015, 8:35 AM

@RyanLane: Thanks for the feedback, and your extensive blog post was a really good read.

I also enjoyed http://ryandlane.com/blog/2015/04/02/saltconf15-sequentially-ordered-execution-in-saltstack-talk-and-slides/

GWicke added a comment.EditedMay 7 2015, 12:22 AM

FWIW, I have recently used Ansible for some simple deploy-related tasks:

  • Rolling RESTBase deploy + restart + check: checks out current master from git deploy repo, performs a graceful restart, waits for each to come back up, aborts if more than 20% fail
  • Rolling restart + check: performs a graceful restart, waits for each to come back up, aborts if more than 20% fail, waits another 10s so that LVS can connect

Just in case that’s useful, I’m a long time Ansible user and use it for deployments − happy to help out if I can.

Joe added a comment.EditedMay 18 2015, 1:37 PM

Although ansible looks really coo l I think it's going to be a tough sell as long as we are using puppet and salt as our 'offical' configuration management and cluster administration tooling.

A tough one indeed.

GWicke added a comment.EditedMay 18 2015, 2:13 PM

A tough one indeed.

We are primarily interested in rolling deployments here, which puppet doesn't support. The question is not so much about puppet vs. X, but about finding a deployment system.

I'm pretty happy with how simple, flexible and reliable rolling deploys with checks + restart are with Ansible (see code). I have used it for RESTBase deploys and restarts in the last ~2 weeks, and didn't encounter any issues. It's easy to adapt the checks to different services, and testing is relatively easy by running the playbook against labs hosts (and/or performing a dry run with --check).

It does tick most (if not all) of the boxes we described in T93428, so looks like a very strong contender from a services perspective.

Joe added a comment.EditedMay 19 2015, 6:09 AM

@GWicke, I regularly use fabric to automate tasks on my side, so no one argues if you want to use ansible on your computer to automate things in production. If we want to use something as an official tool for the WMF, however, I'd either stick with salt or migrate away from it. And we must have a compelling reason to do that. I'm pretty sure writing deploy code for rolling restarts wouldn't be so tough on salt either. At the moment, I don't see any consensus on this. Or, I see a pretty strong consensus in the direction of not throwing everything away if not strictly necessary.

But this must be the fifth time I tell you this, I hope my position is clear this time.

mmodell lowered the priority of this task from Normal to Low.Jul 6 2015, 8:05 PM
mmodell moved this task from Backlog (Tech) to In-progress on the Deployments board.
jeremyb added a subscriber: jeremyb.Jul 6 2015, 9:08 PM
GWicke added a comment.EditedJul 6 2015, 11:15 PM

We have been using Ansible for RESTBase deployments and Cassandra restarts for some time now (~2 1/2 months with our staging cluster, 1 month in prod), and are overall quite happy with it. I found one issue in the way older Ansible versions handle git submodule checkouts (they defaulted to master, rather than the referenced hash), but this is fixed in Ansible >= 1.8. Things that worked well:

  • simple yet flexible per-service config for the deploy workflow: the restbase deploy is defined by 26 lines of yaml
  • deployed to staging before production
  • automatically aborted broken deploys after the first node didn't come back up
  • avoided downtime or dropped client requests by waiting for each node to be ready to serve requests before proceeding
  • rolled back a deploy that turned out to leak memory in the longer term
  • deployed to a single canary prod node and successfully debugged this memory issue after determining that it only manifested itself with the production data set and request load

The improvements I'd like to see next aren't so much in Ansible itself, but in how we integrate with the environment:

  • poll monitoring (graphite, logstash) during deploys
  • get config data from etcd
  • set up the right permissions and ssh keys so that we can share an Ansible install on a deploy host, rather than running it directly from our laptops
GWicke updated the task description. (Show Details)Aug 4 2015, 6:29 PM

We have now started to look into configuration deployment support as well. See T107532: [Discussion] Move restbase config to Ansible (or $deploy_system in general)? for the details.

GWicke closed this task as Declined.Apr 19 2017, 11:12 PM

We have used ansible for restbase deployments until recently, and have now migrated to WMF's own scap 3 tool. Longer term (over the next year) we are planning to move towards container-based deploys using Kubernetes.