Page MenuHomePhabricator

Replace salt on integration and deployment-prep projects
Closed, ResolvedPublic

Description

As a side effect of sunsetting salt, Beta-Cluster-Infrastructure and Continuous-Integration-Infrastructure would need a replacement since we don't have access to the WMCS salt master.

For both beta and CI, we use salt for mass commands execution. Be it checking state of instances, manually upgrading packages, mass clean up or deploying some scripts.

The salt master instances are:

  • deployment-salt02.deployment-prep.eqiad.wmflabs
  • integration-saltmaster.integration.eqiad.wmflabs

That is provisioned on the instances using the puppet.git manifests with a hiera key to switch the master to point to the relevant master.

since salt is being removed, we need either:

  • cumin master (if there is a puppet class for standalone cumin master in labs)
  • cluster shell (apparently used by tools labs)

Event Timeline

beta, CI and other WMCS VPS projects are not environments that either TechOps or WMCS operate and as such, we hadn't incorporated it into our plans of the Salt deprecation (and that's also why it's not listed in our goals). To be honest, I wasn't even aware of this use of Salt, but even if I had known about it, I'm not sure how we could had reasonably do anything about it other than just give you a heads-up, given our unfamiliarity with this environment. Due to dependency on Trebuchet, this was a quarterly goal that was planned and coordinated with Release-Engineering-Team, so I don't think this was a surprise to you regardless? I'm being a little defensive because I see that you made this a subtask of T164780, tagged this as Goal and SRE etc., so I guess you disagree and/or this may be all a surprise to you after all? If not, then feel free to ignore this whole paragraph :)

I still don't consider this a blocker to our goal, nor something we (SRE) would need to do. (That said, if you decide to go the way of Cumin, we'd be happy to help of course :). There isn't much time left until the end of the quarter, so if you don't manage to get rid of Salt by then and want to retain the capability of mass execution, you will probably have to cherry-pick reverts in your puppetmaster, as the plan is to remove all Salt-related code from operations/puppet, and before that, push code that will uninstall the minion and cleanup everything Salt-related across the fleet. The target date for all this is the end of the quarter (September 30th) and it seems that this is still on-track.

I don't really feel like nitpicking over projects or dependencies, but I'm pretty much in agreement with @faidon here and I definitely see our path forward....

This task is correct in that we'll want to replace salt with cumin in these places. At the same time, it clearly isn't going to block ops' goal in production -- that absolutely should move forward as planned. @Volans kindly offered on IRC a bit ago to help us with the migration...after the production work is done :)

https://github.com/bd808/wikimedia-cloud-vps-hostgroup-generator provides an out-of-the-box solution for using clush from any computer to target commands to any Cloud VPS project. Patches to https://github.com/bd808/wikimedia-cloud-vps-hostgroup-generator/blob/master/conf/classifiers.yaml are welcome to provide better targeting for projects that want to use this tool.

For the record, the choice of tags is automatic by the "create subtask" menu option. I don't think @hashar explicitly made those choices.

Thanks @bd808 for the other options.

leaving ops-software-dev as this is about cumin (though that may not be what we use in the end): is there a better cumin-only project ya'll use?

@greg no, that's the right one, cumin it's an additional hashtag of this one ;) Thanks

I have filled this task for what it is: replace salt on integration and beta. There is no evilness intended!

Surely I would have appreciated a drop-in replacement for role::salt::masters::labs::project_master since that has been used for ages and is in the puppet.git repo. Ideally that would have been caught up in the migration plan.

Cumin is most probably state of the art puppetized, so I guess it is going to be straightforward to deploy on beta / integration. One just need time to figure out classes to be applied, the hiera variables to change and maybe a bit of cruft to set the master to be a master.

Change 380760 had a related patch set uploaded (by Volans; owner: Volans):
[operations/software/cumin@master] OpenStack backend: set default query params in config

https://gerrit.wikimedia.org/r/380760

Change 380760 merged by jenkins-bot:
[operations/software/cumin@master] OpenStack backend: set default query params in config

https://gerrit.wikimedia.org/r/380760

Change 380947 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] cumin (WMCS): allow to setup cumin in a project

https://gerrit.wikimedia.org/r/380947

Volans moved this task from In Progress to In Code Review on the SRE-tools board.

Change 380947 merged by Volans:
[operations/puppet@production] cumin (WMCS): allow to setup cumin in a project

https://gerrit.wikimedia.org/r/380947

@hashar want to do the honors of being the first tester? 😉
https://wikitech.wikimedia.org/wiki/Help:Cumin_master

That looks very great! I was expecting it to be straightforward (apply puppet class + set hiera) and that step by step tutorial sounds like it is going to be a walk in the park ! :-D

antoine-approve

Change 381073 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] prometheus: force ferm dns resolution to Ipv4

https://gerrit.wikimedia.org/r/381073

It took me 14 minutes from the time I have opened https://wikitech.wikimedia.org/wiki/Keyholder until I got some cumin command more or less working. That is quite an achievement, my accolades to anyone that got involved in integrating cumin with OpenStack, and of course huge kudos to @Volans for the wikitech doc.

deployment-prep had some oddities:

  • iptables rules not being updated, because ferm was still present while no puppet class was still applying it. That resulted in obsolete rules. I have purged ferm on those hosts.
  • the ferm rule for prometheus (poke @fgiunchedi forces AAAA resolution. However labs does not have any IPv6 support. That caused ferm to fail. Hacked by switching to A resolution ( https://gerrit.wikimedia.org/r/381073 )
  • deployment-kafka-jumbo-1.deployment-prep.eqiad.wmflabs is not reacheable. Puppet is disabled, I guess @elukey or @Ottomata would know.

Summary is:

$ sudo cumin -o txt '*' 'true' 
0.4% (1/70) of nodes failed to execute command 'true': deployment-kafka-jumbo-1.deployment-prep.eqiad.wmflabs
_____FORMATTED_OUTPUT_____
deployment-kafka-jumbo-1.deployment-prep.eqiad.wmflabs: Permission denied (publickey).

So for Beta-Cluster-Infrastructure that is mostly done. Gotta update documentation here and there and write some announcement.

Will tackle Continuous-Integration-Infrastructure next.

NOTE: gotta recreate the beta cluster instance to use the name deployment-cumin (hostnames have to be unique cluster wide)

I have provisioned integration-cumin and deployment-cumin. Have a few scripts to switch to use cumin then I guess we can drop salt from both projects and announce the change \o/

Change 381129 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] fab: migrate from salt to cumin

https://gerrit.wikimedia.org/r/381129

I created deployment-cumin and integration-cumin . Gotta update a few documentation here and there + write an announcement.

I guess we can drop salt tomorrow morning first thing \o/

Mentioned in SAL (#wikimedia-releng) [2017-09-28T08:31:10Z] <hashar> Removing salt configuration from integration and deployment-prep projects. Replaced by cumin. - T176314

Mentioned in SAL (#wikimedia-releng) [2017-09-28T08:39:07Z] <hashar> Deleted integration-saltmaster and deployment-salt02 . Replaced by integration-cumin and deployment-cumin - T176314

@elukey has run puppet on deployment-kafka-jumbo-1.deployment-prep.eqiad.wmflabs. 69/69 instances are reachable!

Change 381129 merged by jenkins-bot:
[integration/config@master] fab: migrate from salt to cumin

https://gerrit.wikimedia.org/r/381129

Salt has been purged entirely!!

https://wikitech.wikimedia.org/wiki/Help:Cumin_master has been the instrumental part to handle all the migration in less than half a day. My special kudos to everyone involved and thank you for your patience!

I have a made a short announce on the QA list: https://lists.wikimedia.org/pipermail/qa/2017-September/002658.html

Hello,

On deployment-prep and integration, I have removed Salt entirely in
favor of Cumin.

It is mostly similar but way better to keep the inventory, make sure you
reach all the instances.  Cumin comes with an elaborate query scheme.

One requires sudo access to be able to use it since commands are
executed on instances as the root user.

Instances:

- deployment-cumin.deployment-prep.eqiad.wmflabs
- integration-cumin.integration.eqiad.wmflabs

Examples:

 sudo cumin '*' 'true'
 sudo cumin 'name:deployment-mediawiki' 'true'

The documentation is quite nice, you certainly want to read the part
about selecting hosts:

https://wikitech.wikimedia.org/wiki/Cumin

Change 381073 abandoned by Hashar:
prometheus: make ferm DNS record type configurable

https://gerrit.wikimedia.org/r/381073

Change 381073 restored by Krinkle:
prometheus: make ferm DNS record type configurable

Reason:
Restoring because a version of this is still live on beta.

https://gerrit.wikimedia.org/r/381073

Change 381073 abandoned by Hashar:
prometheus: make ferm DNS record type configurable

Reason:
That is probably outdated, I lack bandwith to rebase it properly.

https://gerrit.wikimedia.org/r/381073

Change 381073 restored by Dzahn:
prometheus: make ferm DNS record type configurable

https://gerrit.wikimedia.org/r/381073

Change 381073 abandoned by Hashar:
prometheus: make ferm DNS record type configurable

https://gerrit.wikimedia.org/r/381073