Page MenuHomePhabricator

salt minions need 'wake up' test.ping after idle period before they respond properly to commands
Closed, ResolvedPublic

Description

This is likely to be several bugs in one; we'll log themas we find them.

Various people have reported and I have seen, including on the new salt master neodymium, that a test ping to some group of minions or even to one minion may yield few to no responses, if it's been a while (hours? a day?) since the last salt command was run.

All of these cases need to be isolated and fixed.

Event Timeline

ArielGlenn claimed this task.
ArielGlenn raised the priority of this task from to Needs Triage.
ArielGlenn updated the task description. (Show Details)
ArielGlenn added a project: Salt.
ArielGlenn subscribed.

I have located one of these issues. The master rotates its aes key every 24 hours after startup; it also rotates the key after salt-key -d is used from the command line. We have the configuration setting "ping_on_rotate: True" in the salt master config file; this should cause the master to run a test.ping of all minions right after the key is rotated, so that they do not later need a "wake-up".

This functionality is broken for installations that have the minon data cache on the master disabled. (This is promoted as one of the things to do when running into resource issues.) In such cases the master generates a list of connected minions to ping which is empty. When I used an earlier version of the code which simply pings all minions regardless, it functioned properly. Upstream bug report is here: https://github.com/saltstack/salt/issues/29525

I'll be applying a patch for that of 3 whole lines, and then doing more testing/checks to see if the issue recurs.

The next category of problem looks like this:

Master is running, minions are running
On neodymium, I stop the new master and restart it
A little while later (10-15 minutes?) I try from neodymium to test.ping the neodymium minion: no response
I try to test.ping the neodymium minion from palladium: no response
The minion is connected on 4505 and 4506 to both masters as shown by netstat.

I hit this completely by luck, and I can't reproduce this broken behavior in a labs setup; with two masters and one master/minion running my patches under jessie, and with the same master/minion conf settings, I am unable to break the minion after repeated tests.

Strace of the minion when the test.ping command is sent to it from neodymium is just about the same as for test.ping to a working jessie minion, with the exception that the working minion reads its key and does the auth dance with the master, as it does for every command; that's expected behavior.

In particular, both minions, the patched broken one and the unpatched working one, connect back to the master on 4506 and send a packet which I guess to contain returns information. (Encrypted payload so hard to tell.) They both do the proper ZMQ negotiation to set up the channel. In any case I can see that the test.ping arrives at both minions and is processed, and some output sent back to the master. But the master never logs anything to the job cache from the minion, never puts anything on the event bus saying that the job returns or anything else have arrived.

I'm leaving the minion/master on neodymium as they are and continuing testing. PLEASE don't restart them for the next day at least.

Well that was painful.

The SAuth class keeps a cache of keys per master/minion combination (in theory one can run multiple minions on a host, each one talking to a different master, for HA purposes). This class returns a singleton so that the cache is used across all modules and in all minion calls.

However, managing this cache properly and cleaning it up properly everywhere is another thing entirely, and I missed one (maybe 2) of those places in the patch backport.

What must have been happening is that the minion re-authenticated after master aes key rotation, as it was supposed to do, reads and runs commands, but when it comes to setting up the return channel back to the master with the output of the commands, this may fail under the right conditions with the old key used during channel setup. It fails silently which is pretty annoying. I've fixed up the patch (from upstream commits) and it's running now on neodymium only.

Dereckson renamed this task from salt minions need 'wake up' test.ping after idle period before they respond properly to comands to salt minions need 'wake up' test.ping after idle period before they respond properly to commands.Dec 14 2015, 4:59 PM
Dereckson set Security to None.

I'll be applying a patch for that of 3 whole lines, and then doing more testing/checks to see if the issue recurs.

@ArielGlenn: Could you refer to this ticket in the Gerrit commit, so it appears here and we can read along?

The next category of problem looks like this:

Why "next category"? What exactly is resolved in the "previous category"?

Master is running, minions are running
On neodymium, I stop the new master and restart it
A little while later (10-15 minutes?) I try from neodymium to test.ping the neodymium minion: no response
I try to test.ping the neodymium minion from palladium: no response
The minion is connected on 4505 and 4506 to both masters as shown by netstat.

I hit this completely by luck, and I can't reproduce this broken behavior in a labs setup; with two masters and one master/minion running my patches under jessie, and with the same master/minion conf settings, I am unable to break the minion after repeated tests.

Strace of the minion when the test.ping command is sent to it from neodymium is just about the same as for test.ping to a working jessie minion, with the exception that the working minion reads its key and does the auth dance with the master, as it does for every command; that's expected behavior.

Perhaps you could post the (relevant) strace output here?

In particular, both minions, the patched broken one and the unpatched working one, connect back to the master on 4506 and send a packet which I guess to contain returns information. (Encrypted payload so hard to tell.) They both do the proper ZMQ negotiation to set up the channel. In any case I can see that the test.ping arrives at both minions and is processed, and some output sent back to the master. But the master never logs anything to the job cache from the minion, never puts anything on the event bus saying that the job returns or anything else have arrived.

Is the 'broken' minion expected to do the "auth dance" before or after the above?

I'll be applying a patch for that of 3 whole lines, and then doing more testing/checks to see if the issue recurs.

@ArielGlenn: Could you refer to this ticket in the Gerrit commit, so it appears here and we can read along?

Ok, I missed that. But here it is: https://gerrit.wikimedia.org/r/#/c/259671/
And tests show this hack works fine. It's not nice enough for upstream but it will do for us.

The next category of problem looks like this:

Why "next category"? What exactly is resolved in the "previous category"?

Next subcategory of "minions are asleep and don't respond to the first command after some idle period"
The previous one, resolved, was "minions don't respond to first command after master aes key rotation."

Master is running, minions are running
On neodymium, I stop the new master and restart it
A little while later (10-15 minutes?) I try from neodymium to test.ping the neodymium minion: no response
I try to test.ping the neodymium minion from palladium: no response
The minion is connected on 4505 and 4506 to both masters as shown by netstat.

I hit this completely by luck, and I can't reproduce this broken behavior in a labs setup; with two masters and one master/minion running my patches under jessie, and with the same master/minion conf settings, I am unable to break the minion after repeated tests.

Strace of the minion when the test.ping command is sent to it from neodymium is just about the same as for test.ping to a working jessie minion, with the exception that the working minion reads its key and does the auth dance with the master, as it does for every command; that's expected behavior.

Perhaps you could post the (relevant) strace output here?

I should have and I managed to close that window. It's identical to the strace from a working minion but without the public key junk in it (as it has the cached authentication information to use). If we hit another such issue I'll make sure the strace gets in.

In particular, both minions, the patched broken one and the unpatched working one, connect back to the master on 4506 and send a packet which I guess to contain returns information. (Encrypted payload so hard to tell.) They both do the proper ZMQ negotiation to set up the channel. In any case I can see that the test.ping arrives at both minions and is processed, and some output sent back to the master. But the master never logs anything to the job cache from the minion, never puts anything on the event bus saying that the job returns or anything else have arrived.

Is the 'broken' minion expected to do the "auth dance" before or after the above?

Auth should be done once by the minion upon receipt of the test.ping and failure to decrypt the packet with the command in it. After that it should send back returns using the new key.

Note that unlike master key rotation, there is no 'ping_on_startup' option for the master, though it has been discussed upstream; while this sounds attractive, there would be no way to be sure all minions had reconnected to the master before the ping was sent out.

So the master and minion on neodymium have been behaving well for the past few days, and pings across all minions are reasonable given that the rest of the minions are still authing on every command. I'm going to roll up test packages for jessie from here: https://gerrit.wikimedia.org/r/#/c/259674/ and put them out on some of the jessie minions to see how they behave.

bleah, I had broken packages and not sure how I managed it. But piles of code from the singleton auth change were missing in it. Sorted, jessie packages deployed manually on restbase1001 and seems ok. Trusty and precise packages built, not yet tested.

Trusty and precise packages now deployed manually on analytics1001 and dataset1001 after labs testing. Look good. Next move will be to extend testing to clusters (all restbase, all analytics, all .. something for precise ;-) ), after that all of prod (can do without doing labs instances). If no new issues are encountered dealing with scaling, I'll send email with a call for folks to use neodymium and report any problems.

precise running on ms-{bf}e* in esams; trusty running on analytics103* in eqiad; jessie running on restbase2* in codfw. jessie required an ssh loop to get the installs to complete, seems that they are interrupted in the middle as salt is replaced. I recall this from the previous update though we had only a few jessie instances then. Need to find a workaround for that.

precise running on ms-{bf}e* in esams; trusty running on analytics103* in eqiad; jessie running on restbase2* in codfw. jessie required an ssh loop to get the installs to complete, seems that they are interrupted in the middle as salt is replaced. I recall this from the previous update though we had only a few jessie instances then. Need to find a workaround for that.

I have seen the same problem with debdeploy: The debdeploy-minion package had a postinst which restarted the salt-minion (and that trashed itself since the restart of the minion made the postinst fail, resulting in a broken dpkg package status). My fix was to no longer run the restart in postinst, but to decouple it by running the restarts after installation.

...

I have seen the same problem with debdeploy: The debdeploy-minion package had a postinst which restarted the salt-minion (and that trashed itself since the restart of the minion made the postinst fail, resulting in a broken dpkg package status). My fix was to no longer run the restart in postinst, but to decouple it by running the restarts after installation.

In this case the installation itself is interrupted because https://github.com/saltstack/salt/issues/7997#issuecomment-160913751

I've got a workaround going with 'at now'.

for the record, updated also on snapshot1001,2 and on restbase1003 while testing the installation process.

new packages in repo and installed on all production hosts except neodymium (running with local patches still), the labcontrol hosts (will get done when labs gets converted over), mw1041 (broken) and mw1228 (broken).

There may still be a small delay in minion response right after salt key deletion or after a salt master restart, but that's to be expected; at those times authentication must happen and all minions will need to go through the process at the same time.

Closing finally (whew!)