Page MenuHomePhabricator

Get ops feedback regarding the use of SSH for deployment system control channel.
Closed, ResolvedPublic

Description

RelEng is working on a new deployment system, initially focusing on RESTBase T102667: Create or improve the RESTBase deploy method. This is a team goal for the next quarter. It will be good to move away from the current split method of using trebuchet + ansible. Additionally, it's a complex deployment that should utilize many of the new features we want to develop.

This task exists to solicit feedback, especially from the Operations team regarding our plan to use SSH for the control channel. We evaluated a lot of options and the least controversial choice seems to be ssh. Salt could work but in it's current state we don't feel that it is reliable enough to depend on for this mission critical system.

SSH is currently used for MediaWiki deploy triggering and remote execution (scap)—the overhead of SSH is not currently a pain-point for MediaWiki deploys.

Using SSH for a RESTBase deploy will likely require some sudoers tweaks.

Before work is started on the new deployment tool, I want to make sure moving away from a salt-backed deploy towards an ssh-backed deploy doesn't interfere with any long-term ops plans.

Event Timeline

thcipriani raised the priority of this task from to Needs Triage.
thcipriani updated the task description. (Show Details)
thcipriani added subscribers: demon, fgiunchedi, dduvall and 6 others.
Restricted Application added a subscriber: Matanya. · View Herald TranscriptJun 16 2015, 7:51 PM

Is the idea to make a deployment system aimed at RESTBase...or is this a general deployment system of which RESTBase will be the first use case?

Legoktm updated the task description. (Show Details)Jun 16 2015, 7:55 PM
Legoktm set Security to None.

@chasemp the idea is to build a general deployment system (which is a few blocked tasks away T101023) RESTBase seemed like a good first iteration candidate since they currently have a split process.

Using SSH for a RESTBase deploy will likely require some sudoers tweaks.

We are currently deploying using SSH, and the sudo rules on the hosts are already set up. The main thing missing is the ability to run this from a deploy host, rather than our laptops. For this, we'll need

  • a way to either use private deploy keys per deployer, or deploy keys per group
  • holes in the firewall to allow ssh access from the deploy host

It would also be great if the deploy host was running trusty or jessie (tin is still on precise IIRC).

Joe added a comment.Jun 17 2015, 2:22 PM

Can I ask everyone to please state the problem and not the solution (which would be, in your idea, SSH)?

This seems to me like a classical XY problem, but maybe I'm wrong.

So, what we want to be able to do with $deployment_system for RESTbase specifically?

@Joe, see T93428 for our requirements.

Joe added a comment.Jun 17 2015, 2:28 PM

ok @GWicke, how is "ssh" a solution then?

Joe added a comment.Jun 17 2015, 2:31 PM

To be even clearer, I'm specifically asking that this ticket is less generic/fuzzy. I think I have an idea of the needs of restbase (as they have been stated very clearly in the past).

I think how we connect to the machines to execute commands is a small part of a general-purpose deployment system. Else, we're just building a shell script specific for RESTBase.

If this is the case, please do tell me and I'll unsubscribe from this ticket :)

@Joe, I'll let you have the discussion about the pros & cons of SSH.

I was just providing background on the current status of how we are using SSH, and what we'd like to improve about it.

Joe added a comment.Jun 17 2015, 2:47 PM

So, of course it's possible to do as scap does, it works well for hundreds of machines, it should also be good for a fraction. The security model for scap is also nice nowadays, with the keyholder mechanism that removed the need for ssh agent forwarding.

I don't really see a problem with using a scap-like system, but all the features we'd like (rolling restarts with depool, ability to release to a portion of a cluster, storing the state of the deploy and be able to resume, maybe even the ability to start the deploy with a simple git-push) are all well beyond this discussion I guess (and scap's current abilities).

I clearly misinterpreted the ticket to be a bigger business than it is.

In my vision, a deploy system has 3 main layers:

  • a coordination layer, that is executed locally on the deployment server. We don't have this working well anywhere.
  • a transfer layer, or how to move the new code to the hosts. That is e.g. rsync for scap. I actually think this is the part scap gets right atm for 99% of possible uses
  • an execution layer, for executing commands on the hosts before and after the transfer happens. This can be surely done with ssh, or salt (but if we want to use salt, we need to work on it), or any other tool.

to those we could add a "reporting layer" that allows people to follow the state form e.g. a web page.

I was a bit baffled by the fact we started out with thinking at the execution layer, and confused this ticket for a more general one :)

@Joe:

All of the things you mentioned about the deployment system are indeed on our agenda, this is just a tiny subtask to get feedback from operations team about using ssh instead of salt as the command and control channel.

Essentially what we plan to build is a hybrid of git-deploy and scap plus additional features like rolling restarts, pybal depooling, etc.

See T102667 and dependencies for the big-picture, this is indeed a tiny subtask.

@Joe: Speaking to your vision of a deploy system, I think that you have described something very close to what we (Release-Engineering-Team) are working on during the next quarter.

I would personally like it if we end up with:

  • A streamlined and vastly improved version of git-deploy for the coordination layer, focusing on improved usability, better reporting, handling various failure scenarios and giving the deployer better visibility into the process without requiring a lot of error-prone steps in the deployment process.
  • BitTorrent or rsync or maybe even something else for the transfer layer. This should be somewhat modular and interact with the other layers through an abstract interface. Justification for this is that we are likely to need bittorrent for mediawiki deploys in the not-too-distant future, and we might like to use different transfer mechanism for services from what we use on mediawiki. I think this part just needs to be flexible and the rest of the system shouldn't care how the bits get where they need to go.
  • ssh or mcollective for control and what you refer to as the execution layer. This part needs to coordinate with the coordination layer and any reporting tools we build to monitor the process.
mmodell triaged this task as Medium priority.Jun 18 2015, 1:33 AM
mmodell moved this task from To Triage to Next: Feature on the Deployments board.
mmodell renamed this task from Use SSH as RESTBase deployment control mechanism to Get ops feedback regarding the use of SSH for deployment system control channel..Jun 18 2015, 10:04 PM
mmodell updated the task description. (Show Details)

I can see ssh working, in that case we'll have to tackle keys: namely how many keys onto what hosts, how to map human users to keys and so on with some spectrum in between (mostly a brain dump to get the ball rolling)

  1. single key (a-la mwdeploy), needs to be able to operate on potentially all services via sudo, simple but potentially high blast radium
  2. multiple keys (per-service?), limited blast radius to the service in question for example via ssh forced commands but more complex

I can see ssh working, in that case we'll have to tackle keys: namely how many keys onto what hosts, how to map human users to keys and so on with some spectrum in between (mostly a brain dump to get the ball rolling)

  1. single key (a-la mwdeploy), needs to be able to operate on potentially all services via sudo, simple but potentially high blast radium
  2. multiple keys (per-service?), limited blast radius to the service in question for example via ssh forced commands but more complex

SSH seems fine to me as well. How many services are we talking here and how volatile is the deployment process? I'd prefer per-service keys with ForceCommand.

Joe added a comment.Jun 24 2015, 7:27 AM

I strongly oppose to using mcollective, FWIW. I tried it in the past and it's way worse than salt in any concievable way.

I am ok with using ssh but I'd strongly prefer, if we think salt is not a viable tool, we ditch it completely.

And by the way, I think we should get rid of trebuchet/git-deploy as it lacks all the basic features I want from a deployment system, and it adds quite a few anti-features.

mobrovac added a comment.EditedJun 24 2015, 8:38 AM

How many services are we talking here and how volatile is the deployment process? I'd prefer per-service keys with ForceCommand.

This would encompass the currently-existing services on SCA (*oids, CXServer) and RESTBase as well as any other new services that might come along. The current target is to have a deployment system for RESTBase and later on adapt it for MW deploys as well.

I am ok with using ssh but I'd strongly prefer, if we think salt is not a viable tool, we ditch it completely.

Wrt salt, I think we should really start (honestly) talking about whether to ditch it or not. I know that you guys use it extensively (despite the latest hiccups), but from the perspective of a deployer, the lack of debug possibilities when something goes wrong is not tolerable (this being due to needing root on the deployment host to find something out). That may be due to the deployment process itself, or salt acting up, but either way, I'm super stuck as a deployer when that happens. That should be eliminated for the benefit of us all. That brings me to two (outrageous) possibilities:

  • give salt* sudo rights to deployers
  • ditch salt

And by the way, I think we should get rid of trebuchet/git-deploy as it lacks all the basic features I want from a deployment system, and it adds quite a few anti-features.

Yup, thus this and the related tickets.

I strongly oppose to using mcollective, FWIW. I tried it in the past and it's way worse than salt in any concievable way.

My experience with mcollective has been fairly positive, but I'm not in love with it. Really mcollective and salt are both probably overkill for what we really are trying to do. The simplest solution is probably best.

I am ok with using ssh but I'd strongly prefer, if we think salt is not a viable tool, we ditch it completely.
And by the way, I think we should get rid of trebuchet/git-deploy as it lacks all the basic features I want from a deployment system, and it adds quite a few anti-features.

I agree completely, I think we should look for a viable replacement for all current uses of salt, and carry over any features from git-deploy that we actually want, but ditch the tool. If operations is currently using salt simply to automate running arbitrary commands on multiple hosts, and that is the only other use case besides git-deploy, then what about this:

Do you have any experience with fabric? ( http://www.fabfile.org/ )

From reading a little about it, this seems like it might be a good solution for managing ssh connections and creating utilities that automate deployments and formalize remote command and control connections. It really looks like it would be a good solution to the needs of ops as well as the deployment tooling.

Thoughts?

GWicke moved this task from Backlog to Blocked / others on the RESTBase board.Jun 29 2015, 5:23 PM
GWicke moved this task from Blocked / others to Under discussion on the RESTBase board.

I poked at Fabric a bit this morning.

Fabric uses paramiko which doesn't appear to use an up-to-date enough keyexchange method to talk to our servers. Also, looks like the only symmetric ciphers we agree on are aes128-ctr or aes256-ctr; however, we don't have any overlap in the message authentication method:

https://github.com/wikimedia/operations-puppet/blob/production/modules/ssh/templates/sshd_config.erb#L47-L62
vs
https://github.com/paramiko/paramiko/blob/master/paramiko/transport.py#L99-L101

It does look like there's a patch pending for this, FWIW: https://github.com/paramiko/paramiko/pull/356

mmodell closed this task as Resolved.Aug 12 2015, 4:24 PM
mmodell claimed this task.
greg moved this task from INBOX to Done on the Release-Engineering-Team board.Sep 8 2015, 8:44 PM