Page MenuHomePhabricator

[Discussion] Move restbase config to Ansible (or $deploy_system in general)?
Closed, ResolvedPublic

Description

RESTBase is completely driven by its Swagger spec based configuration. This means that we often have a need to co-deploy the configuration and code in a coordinated manner. This is true not only for production, but also for testing.

The current process for coordinating config and code changes is to

  1. disable puppet on all nodes
  2. <s>beg</s> gently ask an ops person to merge the relevant puppet config
  3. manually re-enable puppet on one node
  4. deploy (using Ansible) to that one node only
  5. if successful, proceed with the next node

This is fairly complex, time intensive and easy to get wrong. Things going wrong has led to outages in the past.

Instead, we could move the RESTBase configuration into the RESTBase deploy system. This is mostly straightforward:

  • config variables can be copied to group_vars
  • templates can be converted to Jinja2 templates used by Ansible
  • a stanza to deploy the config is added to the deploy task

However, we don't want to expose secrets like C* passwords in our deploy repo. Possible ways to still support this:

  1. symlink a private group_vars file in place on a deploy host
  2. submodule pulling in a private gerrit repository
  3. manual text file
  4. export private data from puppet on the destination system (to a file), and pull it into ansible at runtime

To me the combination of 3) for testing and 4) for production sounds most promising.

I have two main questions:

  1. do you think this is worth doing, and
  2. how do you think we should deal with the secret issue?

See also

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke added a project: RESTBase.
GWicke set Security to None.
GWicke updated the task description. (Show Details)

I have now prototyped this in https://github.com/wikimedia/ansible-deploy/:

During development and testing, the --check --diff mode to perform a dry-run and see changes that would have been applied was very helpful.

Using this, I have now deployed a new config and code to the staging cluster in preparation for the next bigger deploy. To do this, I temporarily disabled puppet on the staging cluster. To do this properly, we'd need to remove the config file from puppet, and deploy it using Ansible only.

Change 229306 had a related patch set uploaded (by GWicke):
Disable RESTBase config.yaml deploys in puppet

https://gerrit.wikimedia.org/r/229306

So is releng settled on using ansible for deploys as @GWicke's comments in https://gerrit.wikimedia.org/r/#/c/229306/ seem to suggest?

If that is not the case, this move would mean that any ops wanting to do an emergency config change should know how to set up your ansible setup, and how to operate with ansible.

Which is not the end of the world of course, as long as you don't have to know how to operate on 50 different subsystems and you would expect that things like config deploys would not require you to install, configure and understand (at least basically) a new tool specifically for that job.

We want to have one deployment system that addresses all our needs and that will have a clear plan to replace trebuchet/scap.

Until then, doing something like this will just set in stone that restbase will be completely controlled via Ansible, which is exactly what we don't want.

So is releng settled on using ansible for deploys as @GWicke's comments in https://gerrit.wikimedia.org/r/#/c/229306/ seem to suggest?

The deployment cabal has got a basic wrapper around Ansible done, and the next step is to expand its functionalities.

We want to have one deployment system that addresses all our needs and that will have a clear plan to replace trebuchet/scap.

Of course, that's the end goal. That said, this is somewhat orthogonal to this discussion because currently we do not effectively have a configuration management/deployment tool/utility/procedure/etc for RESTBase.

Let me explain. Configuration changes are often tied to the new code to be deployed and vice-versa. In that sense, we have come to think of config change deploys as almost-regular code deploys. They need to be coordinated and we have a genuine need to be able to do a roll-back (of the code, of the config or both) quickly when we notice something is wrong. Puppet doesn't really allow us to do a synchronous deploy of code and config, nor does it allows us to do a hassle-free roll-back.

Until then, doing something like this will just set in stone that restbase will be completely controlled via Ansible, which is exactly what we don't want.

I agree, but I also think this is an over-statement. Until the deployment cabal has a ready-to-use tool, we (both of our teams) are left to our own devices when it comes to RESTBase, so I'd say there is no wrong or right way when it comes to config management.

In the last deployment cabal meeting we have discussed config management and agreed that it is going be an integral part of deployment tooling (on that note, etc.d integration is also something that needs to be tackled) and because of the restrictions on ops/puppet, configuration files (for RESTBase, but also for others as well) are going to live somewhere else.

To get back to the RESTBase config topic, we are, of course, open to suggestions for config management and deployment in the short term, as long as that wouldn't complicate even further the current status quo.

akosiaris triaged this task as Medium priority.Aug 6 2015, 1:40 PM
akosiaris added a subscriber: akosiaris.

So is releng settled on using ansible for deploys as @GWicke's comments in https://gerrit.wikimedia.org/r/#/c/229306/ seem to suggest?

The deployment cabal is currently evaluating several deployment methods—updating trebuchet, modifying scap, and wrapping ansible.

We have done work on both a scap update and adding a wrapper for ansible, and we're currently still in the process of evaluation.

A few days ago at the last deployment cabal meeting we began the discussion of configuration management and the role of deployment tooling surrounding configuration.

So is releng settled on using ansible for deploys as @GWicke's comments in https://gerrit.wikimedia.org/r/#/c/229306/ seem to suggest?

For the record, I did not say that releng had 'settled' on anything, only that they are working on Ansible:

From what I hear through Marko (attendance is limited), Tyler is currently working on expanding on our Ansible work.

So is releng settled on using ansible for deploys as @GWicke's comments in https://gerrit.wikimedia.org/r/#/c/229306/ seem to suggest?

For the record, I did not say that releng had 'settled' on anything, only that they are working on Ansible:

From what I hear through Marko (attendance is limited), Tyler is currently working on expanding on our Ansible work.

Context is tantamount in this conversation, however.

Tyler (& co) is also expanding on scap at the same time. :) See: https://gerrit.wikimedia.org/r/#/c/224374/

@greg, I think the amount of confusion about what the cabal is doing or not illustrates that there is maybe not enough communication and collaboration around the deployment system efforts. How about getting everybody at the same table and talk about the status quo, next steps and timelines?

@greg, I think the amount of confusion about what the cabal is doing or not illustrates that there is maybe not enough communication and collaboration around the deployment system efforts. How about getting everybody at the same table and talk about the status quo, next steps and timelines?

I disagree. The amount of confusion is caused by others assuming things and pushing forward based on those assumptions (ie: this task). If, instead, others would ask the deployment working group direct questions (as @Joe did above) then there wouldn't be as much confusion.

The current work of the deployment working group (cabal) is to fill out this evaluation spreadsheet: https://docs.google.com/spreadsheets/d/1MlEsFxrLvdZdV_G82WEAIvBXr7ArO7nCEKaFClHhJEw/edit#gid=0
(you'll notice that the ansible sheet isn't filled out yet)

To be clear: please don't assume any specific outcome of the deployment working group at this point; no decision has been made on ansible vs scap vs trebuchet.

If, instead, others would ask the deployment working group direct questions (as @Joe did above) then there wouldn't be as much confusion.

This seems to imply that nobody asked, which I don't think is true. I still think that a meeting would be helpful to get everybody on the same page.

Thanks for giving a link to the spreadsheet. I see that about half the services requirements (see T93428) have not made it into your spreadsheet. Is this just an oversight, or something you intend to address? What is the timeline for having a workable solution that would address our requirements?

Thanks for giving a link to the spreadsheet. I see that about half the services requirements (see T93428) have not made it into your spreadsheet. Is this just an oversight, or something you intend to address? What is the timeline for having a workable solution that would address our requirements?

Your requirements certainly informed the list of requirements.

There was a discussion and requirements gathering period last quarter during which everyone in the deployment cabal (with @mobrovac representing the interests of the service's team) agreed on a list of requirements that were laid out here: https://www.mediawiki.org/wiki/Deployment_tooling/Future#Future_requirements

I feel that the specific requirements mentioned in that ticket are all covered, largely by the explicit requirements or by discussion during team meetings.

  • rolling deploys / config changes

Rolling deploys are specifically mentioned

  • needs to be reasonably easy to set up, configure, use and test for developers

While this is not an explicit end-goal for deployment tooling, it is considered in design and is consistently discussed in meetings.

  • integration with build process, CI and staging

This has been omitted because it is somewhat orthogonal to deployment tooling. A deployment tool should certainly be scriptable and automate easily (although Trebuchet in beta cluster is certainly a good example of why this is important).

  • minimum privilege operation

Explicitly mentioned in goals

  • (eventually) consistent deploys

This item is somewhat opaque, but "Command and control mechanism that allows for easy aborts, provides needed feedback without swallowing errors", "Rolling deploys and the ability to configure the size of the initial deploy pool", and "Rollback mechanism for failed health checks" seem to capture this item.

The explicit end-goal of this quarter is https://phabricator.wikimedia.org/T102667 . We're on-track towards meeting that goal given the explicit, previously agreed-upon requirements we laid out last quarter. Next meeting, the topic will certainly be a specific timeline for decision making.

What the cabal has accomplished over this past month is work on both scap and ansible in an attempt to evaluate each tool against the previously agreed-upon criteria.

The work on scap is to make it do deploys in a similar way to Trebuchet (using the repo_config info, but in a file instead of a salt pillar) while also doing service restarts and post-deploy port-checks (see: https://gerrit.wikimedia.org/r/#/c/224374/) .

Working from what @GWicke has done with ansible we created a small python wrapper for ansible which I demoed at the meeting on Monday—when there wasn't really much of a quorum at the cabal meeting (see [the horribly named] https://github.com/thcipriani/scale). The idea here is that there is a current solution implemented in ansible that works for RESTBase deploys; however, that solution put succinctly and applied to other repositories is: learn ansible and write some playbooks. Ideally, there would be a more general wrapper for ansible that allows folks to take the repo_config out of the saltstack pillar and put it into a yaml file without other modifications. This gets around the problem of having to write an entirely new deploy playbook for every tool (a new collection of deploy playbooks that would likely have to be maintained by ops since it's running stuff on the servers). This also gets at the next problem of how to not have 3 deployment tools forever.

So that's the past, present, and future of The Deployment Cabal™.

ANYWAY, having said all that, I feel like what the deployment cabal has (or has not) decided in this case has very little bearing in this instance. It seems that the discussion should be about whether or not to move config for RESTBase out of puppet—while this will inform the design of the deployment tool (since @mobrovac emphasized the importance of config change deploys during the last meeting) I don't feel the deploy tool should drive this decision.

Thanks for giving a link to the spreadsheet. I see that about half the services requirements (see T93428) have not made it into your spreadsheet. Is this just an oversight, or something you intend to address? What is the timeline for having a workable solution that would address our requirements?

Your requirements certainly informed the list of requirements.

There was a discussion and requirements gathering period last quarter during which everyone in the deployment cabal (with @mobrovac representing the interests of the service's team) agreed on a list of requirements that were laid out here: https://www.mediawiki.org/wiki/Deployment_tooling/Future#Future_requirements

Our requirements haven't changed since March, and given that you didn't propose any changes I'm assuming that you are aiming to address them.

I feel that the specific requirements mentioned in that ticket are all covered, largely by the explicit requirements or by discussion during team meetings.

I hope to be wrong about this, but my fear is that things that aren't explicitly checked are easily forgotten, leading to an outcome that would not meet our requirements. By making things as explicit as possible, we should have a better chance of finding the magic solution that works for all of us.

  • integration with build process, CI and staging

This has been omitted because it is somewhat orthogonal to deployment tooling. A deployment tool should certainly be scriptable and automate easily (although Trebuchet in beta cluster is certainly a good example of why this is important).

Fair enough. Any deploy system that is reliably scriptable and supports a variety of deploy methods including artifact-based ones (not just the code) should qualify for this. Maybe worth mentioning artifact deploy support (jars, tars etc) explicitly?

  • (eventually) consistent deploys

This item is somewhat opaque, but "Command and control mechanism that allows for easy aborts, provides needed feedback without swallowing errors", "Rolling deploys and the ability to configure the size of the initial deploy pool", and "Rollback mechanism for failed health checks" seem to capture this item.

Eventually consistent deploys is about nodes that missed a deploy catching up to the desired state, and not starting up in an undefined state. This does not seem to be covered.

The explicit end-goal of this quarter is https://phabricator.wikimedia.org/T102667 . We're on-track towards meeting that goal given the explicit, previously agreed-upon requirements we laid out last quarter.

While a great start, those requirements are really the bare minimum. In practice, they represent no tangible progress from the RESTBase deployment status quo in June.

To me, that is completely okay if we find a more general or more powerful solution that will allow us to address the next challenges. I brought up some of those from the services perspective in https://phabricator.wikimedia.org/T103344#1387738, partly to make sure that we have some of those challenges on the radar. This task is about a possible solution to the first of those items, namely config deployments. I would appreciate to hear about your thoughts and plans on the actual config deployment mechanics, perhaps in a reply on the task?

Ideally, there would be a more general wrapper for ansible that allows folks to take the repo_config out of the saltstack pillar and put it into a yaml file without other modifications. This gets around the problem of having to write an entirely new deploy playbook for every tool (a new collection of deploy playbooks that would likely have to be maintained by ops since it's running stuff on the servers). This also gets at the next problem of how to not have 3 deployment tools forever.

You could also parametrize a regular playbook, so that it can be used for all 'simple' services of a given class that have no particular config or health check needs.

GWicke renamed this task from [Discussion] Move restbase config to Ansible? to [Discussion] Move restbase config to Ansible (or $deploy_system in general)?.Aug 6 2015, 11:17 PM

I hope to be wrong about this, but my fear is that things that aren't explicitly checked are easily forgotten, leading to an outcome that would not meet our requirements. By making things as explicit as possible, we should have a better chance of finding the magic solution that works for all of us.

I have a note on the next meeting agenda to revisit the spreadsheet and make any appropriate adjustments in addition to the note to create a more explicit timeline. Also, forgot to mention, all meeting notes are posted publicly at https://www.mediawiki.org/wiki/Deployment_tooling/Cabal

Eventually consistent deploys is about nodes that missed a deploy catching up to the desired state, and not starting up in an undefined state. This does not seem to be covered.

We have discussed this in our meetings. I maintain that having a single source-of-truth for the correct deployed revision and the ability to deploy a revision to a single node enable the ability to expand tooling to cover this use-case.

While a great start, those requirements are really the bare minimum. In practice, they represent no tangible progress from the RESTBase deployment status quo in June.

I think RESTBase had most of this stuff in some way, shape, or form in April, likely. We're trying to add some abstractions to it, keep it generalized so we don't end up further fragmenting the deployment system landscape, find the best approach, and let the review process work its course.

I would appreciate to hear about your thoughts and plans on the actual config deployment mechanics, perhaps in a reply on the task?

It depends on where the config lives—if the config remains in puppet, we can figure out a general way to have a puppet run be part of the deploy process (like a pre-deploy tasks that runs: puppet agent --enable && /usr/local/sbin/puppet-run).

I understand that having a definitive answer on whether or not to use ansible makes this ticket easier to resolve, but I don't have that info yet. I don't want that decision to get made without the input of the team who's been working on it.

a general way to have a puppet run be part of the deploy process

As documented in the task description, you'd need to disable puppet before the deploy on all nodes, then automatically merge and sync the right puppet patch, and finally re-enable and run puppet in a rolling fashion, in coordination with the code deploy. In the roll-back case, you'd need to automatically roll back the puppet commit as well. I'm not saying that it's not possible, but it does not sound trivial to me.

Also, it seems that this would suffer from the same testing issues that puppet suffers from, which would miss our 'reasonably easy to test' requirement.

I understand that having a definitive answer on whether or not to use ansible makes this ticket easier to resolve

I don't think that's necessarily the case. As long as you are committing to provide equivalent functionality in the future deployment system, we should be able to adopt whatever you choose in the future. In the meantime, we can reduce the risk of deployment-induced outages by more thoroughly testing and coordinating config deploys. We would also like to follow Faidon's good advice in T92636, as well as MediaWiki's, Parsoid's & other's example in maintaining the config outside of puppet.

a general way to have a puppet run be part of the deploy process

As documented in the task description, you'd need to disable puppet before the deploy on all nodes, then automatically merge and sync the right puppet patch, and finally re-enable and run puppet in a rolling fashion, in coordination with the code deploy. In the roll-back case, you'd need to automatically roll back the puppet commit as well. I'm not saying that it's not possible, but it does not sound trivial to me.

Also, it seems that this would suffer from the same testing issues that puppet suffers from, which would miss our 'reasonably easy to test' requirement.

One approach that I've been thinking about a lot lately would be to keep some pupet modules in a separate repo outside of operations/puppet, this would be puppet configs specific to the services, and we could apply them with puppet apply instead of a centralized puppetmaster. This would enable the deployment tooling to explicitly trigger a puppet run at the right time and in a rolling fashion along with the code updates.

The advantage of doing it this way are fairly straightforward:

  • Reuse existing puppet code and our significant amount of shared puppet knowledge and experience.
  • Minimal effort, no surprises, no learning curve. puppet apply is slightly different from puppet agent but the differences are minimal, even inconsequential for most simple use-cases.

And last, but certainly not least:

  • Atomic updates and rollbacks. Because the config can be kept in the same repo as the code, it's easy to ensure that the two are in sync with each deployment or rollback.

One approach that I've been thinking about a lot lately would be to keep some pupet modules in a separate repo outside of operations/puppet
...

I've been thinking of using this approach for phabricator configuration, mostly to sidestep the need for a +2 from ops every time I touch phabricator configs.

@mmodell, would this get along with a running puppet agent for general system configuration?

@GWicke: yes it would, as long as the rules applied by puppet apply don't try to change something that is managed by puppet agent - in that case you could have them fighting over a setting, each one changing it back and forth each time they run.

So, to summarize the discussion:

  • ops have not clearly stated whether they prefer service configs in puppet (recent comments) or outside of it (T92636)
  • release engineering is thinking about config deploys as well, but is waiting for ops to signal what they want
  • services would like to deploy configs along with the code, and there is working code

How can we move this forward?

How can we move this forward?

It is perhaps worth pointing out that, since the deployment group will be working on integrating config deployment into the next-gen deployment tooling, we'd effectively be moving it to our Ansible repository only temporarily.

Yup:

I understand that having a definitive answer on whether or not to use ansible makes this ticket easier to resolve

I don't think that's necessarily the case. As long as you are committing to provide equivalent functionality in the future deployment system, we should be able to adopt whatever you choose in the future. In the meantime, we can reduce the risk of deployment-induced outages by more thoroughly testing and coordinating config deploys. We would also like to follow Faidon's good advice in T92636, as well as MediaWiki's, Parsoid's & other's example in maintaining the config outside of puppet.

  • ops have not clearly stated whether they prefer service configs in puppet (recent comments) or outside of it (T92636)

I think a mediawiki-config kind of deal works great -- I cannot imagine it being similarly effective if the MediaWiki config was in puppet. Relying on puppet to deploy your config quickly and reliably (without being affected by unrelated puppet errors in the rest of tree) is not a great strategy. Plus, there's the whole access argument: with Puppet, there is higher barrier to entry, which artificially limits the amount of people that can be in a deployers group.

On the other hand, I don't mind having configuration files within puppet for various pieces of software that are perhaps small and not updated very often (e.g. things outside the regular deploy train). So, that's OK too.

As a counter-example to mediawiki-config, I think folding apache-config into puppet was in fact a positive move that has worked out well, as it's better integrated with the rest of our configuration and the system itself.

So all in all, I think the best would be to evaluate on a case-by-case basis and apply common sense rather than having strict rules such as "{no,all} services config files in puppet" (e.g. if the services team is frequently blocked by ops to do their work and ops is adding little to no value in reviewing such commits, it's a very good indication that this is an unnecessary hurdle and we should explore alternatives to make the workflow smoother and less obtrusive).

  • services would like to deploy configs along with the code, and there is working code

All that said, ^^^ that is not particularly appealing to me as an argument. You've unilaterally made the decision of switching to a different way of deploying your code (despite explicit objections over it, including an unmerged Gerrit changeset) and now you want to expand its usage to cover other use cases as well, partially based on the argument that it's already "there" and "working" and the fact that it will be "temporary" as @mobrovac said. I'm not very fond of this strategy and frankly, I'm not willing to support it.

In technical terms, I can't see how we can merge https://gerrit.wikimedia.org/r/#/c/229306/ ("This patch disables config.yaml deploys in puppet, so that the RESTBase configuration file can be fully controlled by the deploy system.") when the "deploy system" change referenced there is https://gerrit.wikimedia.org/r/#/c/219253/ and that commit remains unmerged. Those two patches have an implicit dependency and the fact that you've worked around the latter by yourself and despite our objections isn't an excuse that satisfies that dependency.

Faidon, thanks for the clarification on puppet vs. other options.

that is not particularly appealing to me as an argument

It is not intended to be an argument. Please see the task summary for some background on why we need a way to deploy configs. I'd be curious to hear your thoughts on how to address those issues as well.

As you know, we have been looking for a sane deployment system for a long time now (it's been years), and yet all we get to use is trebuchet or shell loops. Given those choices, we found a saner way of doing shell loops with Ansible, and have used that since.

when the "deploy system" change referenced there is https://gerrit.wikimedia.org/r/#/c/219253/ and that commit remains unmerged

That patch only disables trebuchet and avoids confusion from old code being auto-deployed. Originally it also started up old code, but we made sure that the code on tin will not start up since. Note that both @thcipriani and @mobrovac are supporting merging this.

Given the long wait already, I think it's fair to ask release engineering and ops to come up with a firm deadline for providing a system that satisfies our requirements (see T93428) fairly soon, or stop blocking our use of Ansible until that system materializes.

@GWicke: I think I can speak for Release-Engineering-Team here: We do not have any desire to block ansible, or any other sane solution that works. We did come up with one fairly big concern about ansible during our testing: Essentially ansible requires full root sudo capability on the target machine. There is no way to scope it to only use sudo for a few commands. That prevents us from implementing any kind of "minimum necessary" security for deployers. If it weren't for that, we would probably be able to support ansible without reservations, however, I'm not sure that full root sudo for all deployers is a good idea. It might work for specific servers/services but not in general.

As for commitments, we already committed to getting something working this quarter. We are not there yet, due mostly to taking a lot of time evaluating various options, including ansible. We set a deadline for choosing a solution next Monday and should still be on track for having something working before long.

I totally understand the frustration of being blocked by other teams, and we certainly don't want to put up obstacles to getting things done. Hopefully we can get this figured out sooner rather than later.

Essentially ansible requires full root sudo capability on the target machine.

That depends on how permissions are set up. By default, Ansible does not use or require sudo at all, and it can use sudo per-command if needed, for example to restart a service managed by systemd. If we used a shared deploy user per service that owned the files and had restricted sudo access to restart the service, then Ansible would have no trouble running everything but the service restart as that user.

Essentially ansible requires full root sudo capability on the target machine.

That depends on how permissions are set up. By default, Ansible does not use or require sudo at all, and it can use sudo per-command if needed, for example to restart a service managed by systemd. If we used a shared deploy user per service that owned the files and had restricted sudo access to restart the service, then Ansible would have no trouble running everything but the service restart as that user.

FWIW, this hasn't matched my experience using the mwdeploy user with the ansible service module on staging-test-tin to restart the service on staging-restbase01 via ansible. Even going so far as to change /etc/sudoers.d/mwdeploy to include:

mwdeploy ALL = (root) NOPASSWD: /usr/sbin/service *
mwdeploy ALL = (root) NOPASSWD: /bin/systemctl *

It seems like the workaround here would be to use the ansible raw module to pass explicit sudo service...

There is an issue on github that tracks this: https://github.com/ansible/ansible/issues/5712
Also, I made a gif to demo the issue in a simplified way: http://tyler.zone/ansible-restart-permissions.gif

@thcipriani: I see, that is indeed a bit annoying. Perhaps not the end of the world for the restart use case, but definitely not as clean as I had expected.

A key quote from the github issue:

"If you run with -vvvv you will see exactly what ansible executes, since it is using temp files it will be pretty hard to come up with an acceptable sudo pattern, most ansible users using sudo, have ALL."

So it appears that this isn't really fixable and unfortunately detracts from ansible's otherwise fairly elegant architecture.

So it appears that this isn't really fixable and unfortunately detracts from ansible's otherwise fairly elegant architecture.

Yeah, it feels a bit dirty. However, I think it's not the end of the world to have the restart job be

- name: restart restbase
  raw: sudo systemctl restart restbase

instead of the current

- name: restart restbase
  service: name=restbase state=restarted

Change 229306 abandoned by Faidon Liambotis:
Disable RESTBase config.yaml deploys in puppet

Reason:
Obsolete now.

https://gerrit.wikimedia.org/r/229306

Pchelolo claimed this task.

After moving to scap3 this ticket is obsolete, resolving.