Page MenuHomePhabricator

Move salt master to separate host from puppet master
Closed, ResolvedPublic

Description

Puppet uses a lot of cpu on palladium. The salt master should live on its own. host.

Event Timeline

ArielGlenn claimed this task.
ArielGlenn raised the priority of this task from to Medium.
ArielGlenn updated the task description. (Show Details)
ArielGlenn added projects: Salt, acl*sre-team.
ArielGlenn subscribed.

work plan looks like this:

make sure jessie install looks good
add salt master role, copy over all minion keys
add master manually as secondary to one client, restart its minion, check commands to it
add second master everywhere one cluster at a time, restarting minions and verifying that primary and secondary masters can send commands
do some performance testing on new (secondary) master
swap primary and secondary masters everywhere
restart minions cluster by cluster, testing commands from primary and secondary masters
announce and allow testing for one week before we pull old master entry from configs

Should be able to start on this as soon as Robh hands over the new box (in a day or two).

Aalt master role plus deployment role plus debdeploy role added. master key is different than old master so I need to look at that. Additionally I wonder about having two masters both copy pillars over to the minions, as appears to happen for the deployment role.

Currently whining during puppet runs because it has no minions known. this is fine; we'll do minion testing a bit later.

minion and master keys copied over from palladium, test of one minion completed.

https://gerrit.wikimedia.org/r/#/c/252651/ added as secondary salt master to all hosts, tested on one client, responds to commands from neodymium

Where we are:

First off, Brandon added neodymium to the salt master exception in the router configs so that neodymium can communicate with all hosts, e.g. analytics*.

Second, testing is giving mixed results. Doing test.ping without batches (and -v) lists a lot of hosts (> 400) that don't reply, even with timeout 20, 40, 60 seconds. However, when I check the results of the job via the job id, the hosts have replied. So that's a change from palladium but I'm still investigating why the responses aren't reported to the salt client properly.

Third, I had expected to avoid the need for salt key acceptance on both masters, since neodymium is listed as a secondary master, meaning minions will try to connect to palladium first at startup and not die if they can't reach neodymium. Indeed that's part of the point of the standard multimaster setup in salt. However, for minions that have unaccepted keys on neodymium, it appears they hang waiting for acceptance and do not process salt requests from palladium. I'm investigating this further.

First results of debugging on neodymium: the salt cli client stops receiving events well before its timeout, and sits idlely until the timeout is reached at which point it runs find_job, but by that time all minions have already completed the job so it comes up negative. I see the missing events streaming in via the event listener script.

Well, I was misled by the timestamps in the events read and reported by the debugging script and by the command line client. Here's what happens.

Each minion auths and then returns its value. That's two events logged by the master. The event listener just reads events and spits them out to stdout as a proper debugging tool would, so it receives each event from the master just about at the time it's sent. The command line client does not; it has a loop which gets events, and if there is no event it sleeps 1/100 of a second so as not to spin, before trying again. But "no event" includes "there's an event but it's not the kind I want", i.e. auth events. And there's 1180 auth events so that's already 11 seconds that go by just from that alone. So what I see when I print out the timestamp of when the command line client reads an event that it wants (a test.ping return) and I compare to the timestamp in the event itself, I see eventual drift, and then it has reached the timeout. It then does a find_job, sees that no minions are still running the job, and gives up.

This lovely issue is gone in 2015.8 as I look at the code, because the event handling methods have been refactored. I didn't see a bug about this so I'll go ahead and open one upstream for the 2014.7 release and see what they say. In the meantime I'll apply a local patch for testing purposes only on neodymium, so that I can continue performance testing.

After applying a patch locally, the above issue was fixed, but there were still 20 to 40 hosts besides the network-unreachable ones that failed to respond, according to the salt command line client. Command run was "salt '*' -v -t 20 test.ping". The eventlistener script read responses from those servers correctly; the cli client stoppped seeing events at some point, polling and getting no events for another 10 seconds.

That smelled to me like some queue limit being set too low, and indeed it was. An undocumented master config setting, "pub_hwm", controls how many unread messages the master keeps on its pub sockets, the default being 1000; after that it drops them. So you can imagine that if a couple thousand auth and test.ping response messages come in within a couple seconds but the client takes a few seconds longer to wade through the first 1k of them, it loses events. I've changed that setting in https://gerrit.wikimedia.org/r/#/c/255526/

I should add that besides the minion key auth events emitted by the master, there are also two events per test.ping with tags in the old and new (new for this version of salt) format, so even more events to pile up in the queue.

Currently looking at why the minion auths on every request. It should not need to request the new master aes key unless the key has been rotated or the master or minion has been restarted. Turns out that the transport channel factory (now that there is Raet and ZMQ) instantiates ZeroMQChannel on each command, which sets up salt.crypt.SAuth which is the culprit. Looking into how we can work around that.

Fighting with a backport of the conversion of SAuth to a singleton, from fb747fa of the development branch, Tedious.

Finally have what seems to be a working backport. Now I see no additional auths from the minion when doing a test ping from the master. Yay! Time to build a package for trusty and test it on a few prod hosts manually.

Faidon fixed up the network issue for the labvirt hosts so they are now reachable from neodymium. Requested a new project in gerrit for our salt builds.

New repo is already available, yay! In the meantime, looked into the initial delay of minions that don't respond to commands after having been 'idle' for awhile; indeed there is the auth delay, probably due to the long max value set in the config for overloaded palladium. Since neodymium does not have that problem, I've reduced the value to something much more suitable: https://gerrit.wikimedia.org/r/#/c/256235/

I was seeing some really troubling behavior, a nonresponsive minion when test.ping to the one minion that made me think my changes had suddenly gone awry. But today it all looks fine. (Yes I'm still concerned.) Made one more config change so that reconnections to the master after lost connection may take a max of 10 seconds, instead of the much longer value we had for palladium.

I'm watching what happens after puppet runs and the minions are all restarted; if all looks well I'll get the code into the repo so packages for testing can be built.

Well I have just seen this behavior again and I don't know if it's specific to my changes to the minion code or not, so I'm investigating it. Slow going because the multimaster setup also impacts minion connections.

At least one of the 'minion doesn't see incoming test ping after lapse of some hours' issue is apparently a problem with the master config option "ping_on_rotate" which looks to me like it's been broken since 2014.7.something, up to the current release. I'm opening a new task for this minion "sleep and wake up" issue because there are several other isses that may also cause this and I need to check them all (minion behavior with secondary master, zmq connection issues, etc.)

After upgrading all but about 5 production hosts to the new packages with our patches, I now see this interesting but annoying behavior.

When I was about halfway through the upgrade I could salt '*' test.ping and get all 1196 responding hosts back. Now I get fewer responses displayed from the salt script at the command line, and the number is erratic (it might be a little over 1000 or closer to 1100).

The hosts all do actually return values, as I see by checking the results in the master job cache for the specified job. So it's just that they don't get back in time for the client to read them. I've tried extending the timeout to the test.ping and that doesn't seem to help.

Interestingly enough the hosts that don't get their response in in time are always esams/codfw/ulsfo, never eqiad. I wonder if there might be some network setting at play, something that could be tuned. Back to scrying on the event bus.

Upped two more network settings, the packet queue length and some memory limits: https://gerrit.wikimedia.org/r/#/c/261195/

We now consistently see returns from salt '*' test.ping on neodymium come in in about 2 seconds when we monitor via the python event listener script. The salt client does not see them all in time though; either I left something out of our package or there's still some extra sleeps in the event retrieval cycle eating up useless time until the timeout is reached. Note that the behavior is the same for e.g. -t 15: the command line client still doesn't catch all the returns in time. It gets most of them in the first two seconds and then waits around until the timeout is reached, with no new events retrieved. Checking.

https://gerrit.wikimedia.org/r/261329 this fixes the above: the salt command line client was throwing away events in the ZMQ backlog because too many came in at once. With this patch the length of the backlog is configurable and uses the same setting as the master ZMQ backlog, as the max backlog for both should be about the same.

Now salt '*' test.ping takes about 2 seconds on neodymium. Consistently. There's still a syn flood warning on master restart so I need to bump that setting up a bit.

Jessie package tested on neodymium and works as advertised. https://gerrit.wikimedia.org/r/261334

The new wm2 packages are now installed on all production hosts except for: mw1041.eqiad.wmnet, technetium.eqiad.wmnet, mw1228.eqiad.wmnet, ms-be1011.eqiad.wmnet. Status of those hosts:

mw1041.eqiad.wmnet gives NXDOMAIN but still has salt keys, removing those

ms-be1011.eqiad.wmnet is listed as down in icinga
technetium.eqiad.wmnet, mw1228.eqiad.wmnet are not reachable by ssh nor by salt

Proceeding to lab instance updates.

Labs salt update

Because (as usual) a pile of instances have issues, I'm doing the old standby of the ssh loop, which will update salt only on hosts which have the labcontrol hosts for salt master. This is checked by looking at the master key finger in /etc/salt/minion on these instances before going ahead with the upgrade. This process will likely take several hours, I'm not doing anything in parallel or indeed even looking at it run.

A few hundred instances were updated already via salt.

salt updated and responsive on all lab instances that don't have their own salt master, with the following exceptions:

towtruck.visualeditor.eqiad.wmflabs -- no route to host for ssh
sectools-web1.security-tools.eqiad.wmflabs -- no route to host for ssh
icinga.icinga.eqiad.wmflabs -- upgraded via ssh, no test.ping, was out of space (fixed). ferm??
puppet-testing.chasetest.eqiad.wmflabs -- connection reset by peer for ssh
tools-worker-06.tools.eqiad.wmflabs -- ssh hangs forever
mobile-hierator2.mobile.eqiad.wmflabs -- no route to host for ssh
mc2002.mdc-east.eqiad.wmflabs -- connection timed out for ssh
labs-dnsrecursor2.openstack.eqiad.wmflabs -- pdns-recursor dies, no dns service, can't even look up ldap servers. instance might get deleted.
tools-checker-01.tools.eqiad.wmflabs -- connection reset by peer for ssh
wmt-exec.wmt.eqiad.wmflabs -- connection refused for ssh

icinga is fixed, the remaining failure was due to the wrong key name (I saw a batch of "old-style" names without the project in them, for keys and cleaned them out but missed this one).

I lie, there are about 35 instances not yet upgraded that are however happily salt responsive. I'll have to go in and deal with them by hand.

salt updated on deployment-prep except for deployment-restbase01 which is running sid. I haven't built sid packages and don't plan to.

After the update, one host is no longer responsive to salt: deployment-elastic06. To be investigated.

wikidata-stats.wikidata-dev.eqiad.wmflabs:

Minion did not return. [No response]

tools-worker-1002.tools.eqiad.wmflabs:

Minion did not return. [No response]

salt updated and responsive on all lab instances that don't have their own salt master, with the following exceptions:

towtruck.visualeditor.eqiad.wmflabs -- no route to host for ssh
sectools-web1.security-tools.eqiad.wmflabs -- no route to host for ssh
icinga.icinga.eqiad.wmflabs -- upgraded via ssh, no test.ping, was out of space (fixed). ferm??
puppet-testing.chasetest.eqiad.wmflabs -- connection reset by peer for ssh
tools-worker-06.tools.eqiad.wmflabs -- ssh hangs forever
mobile-hierator2.mobile.eqiad.wmflabs -- no route to host for ssh
mc2002.mdc-east.eqiad.wmflabs -- connection timed out for ssh
labs-dnsrecursor2.openstack.eqiad.wmflabs -- pdns-recursor dies, no dns service, can't even look up ldap servers. instance might get deleted.
tools-checker-01.tools.eqiad.wmflabs -- connection reset by peer for ssh
wmt-exec.wmt.eqiad.wmflabs -- connection refused for ssh

These hosts don't respond to ping right now but I can't ssh into them either:
wikidata-stats.wikidata-dev.eqiad.wmflabs
tools-worker-1002.tools.eqiad.wmflabs

All other lab instances that have the labcontrol1001 salt master have been updated and respond to ping.

I sent a second reminder mail asking folks to test neodymium as salt master for salt commands (not yet for git deploy). It's a copy of my email from Dec 30 so no new content.

It's working well for me, I ran salt commands for the entire cluster using host-based '*' matching (both using batches of 200 hosts and w/o batching) and it worked fine for me.

There's a slight discrepancy in active hosts, though: I'm getting 1200 results from salt, while there's 1208 systems in puppet according to servermon (but some minions might be explicitly shut down)

I can report the same; just ran some queries that have long been unreliable but it's fast and apparently really, really reliable.

Didn't have any problems in the last two days of usage.

It's working well for me, I ran salt commands for the entire cluster using host-based '*' matching (both using batches of 200 hosts and w/o batching) and it worked fine for me.

There's a slight discrepancy in active hosts, though: I'm getting 1200 results from salt, while there's 1208 systems in puppet according to servermon (but some minions might be explicitly shut down)

There's a few hosts with unaccepted keys, and 4 hosts with no response to test.pnig of which 2 are down in icinga, one isn't even lsited in icinga (reinstallation?) and one is behaving badly for ssh also. We'll always have such issues.

git-deploy moved to neodymium yesterday, debdeploy was moved by moritz today. Giving a couple of days for any problems to shake out, on Thursday palladium will be removed as salt master.

Please note that we will need to find a way to allow salt key signing during the reimaging/imaging of a server; this isn't a blocker for the decommission of palladium, but I'd keep this ticket open until then.

Neodymium is now the only salt master, and just to be sure, I've removed the salt-master package from palladium as well as the role.

The following hosts do not respond to test.ping on neodymium:

mw1172.eqiad.wmnet:
mw1217.eqiad.wmnet:
mw1228.eqiad.wmnet:
analytics1017.eqiad.wmnet:
mw1257.eqiad.wmnet:
mw2173.codfw.wmnet:
mw1178.eqiad.wmnet:

Of those, all are down (no ping) except analytics1017, which doesn't permit me to ssh in either.

In addition, rhodium has a puppet cert but doesn't respond to ping or ssh and it has no salt key; I have made sure that all other hosts with puppet certs have salt keys and that all hosts with sat keys have certs in the puppet cert signed directory.

rhodium is waiting to be installed, puppet cert is dead so I tossed it. Not responding to salt as of today are: analytics1017.eqiad.wmnet, still no ssh in etc, and three hosts not pingable and listed as down in icinga. Calling this done, except for the wmf-reimage fixes.

And it's done, even with the blocking task listed about parsoid. They have a workaround they have been using for months. Thanks Joe for fixing up wmfreimage.