Page MenuHomePhabricator

[Trebuchet] Salt times out on parsoid restarts
Closed, DeclinedPublic

Description

[13:29] <gwicke> we got an early timeout from salt during the parsoid deploy, which seems to be a salt misconfiguration
[13:31] <gwicke> I see cmd = ("sudo salt-call -l quiet --out json publish.runner "
[13:31] <gwicke> "deploy.restart '{0}','{1}'")
[13:32] <gwicke> in /usr/local/bin/service-restart
[13:32] <gwicke> on tin
[13:34] <gwicke> from http://docs.saltstack.com/ref/cli/salt.html: -t TIMEOUT, default 5
[13:35] <gwicke> I think I got the timeout around 5 seconds
[13:36] <gwicke> parsoid needs up to about 90 seconds for a graceful restart


Version: wmf-deployment
Severity: normal

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:52 AM
bzimport set Reference to bz61882.

From Ryan's main:
TL;DR:

Ensure batching doesn't occur by setting the batch value to 100%:

service-restart --repo 'parsoid/deploy' --batch='100%'

Long version:

I spent some time tracking this down tonight. From the master the following commands work reliably:

1: salt -b '10%' -G 'deployment_target:parsoid' service.restart parsoid
2: salt -b '10%' -G 'deployment_target:parsoid' deploy.restart 'parsoid/deploy'
3: salt-run deploy.restart 'parsoid/deploy' '10%'

#2 is a wrapper function for service.restart that maps a repo to a service, then restarts it (it has other future uses that make it necessary as well). #3 is a runner that can be easily referenced from peers like tin to make handling security easier, it basically just calls #2.

From tin we call:

sudo salt-call -l quiet --out json publish.runner deploy.restart 'parsoid/deploy','10%'

This calls #3 on the master via salt's publication system.

The problem here is that publish.runner has a default timeout. When that timeout is reached, the runner stops executing. Part of the issue is that it times out at all. The other part of this issue is that this command has no timeout argument: http://docs.saltstack.com/ref/modules/all/salt.modules.publish.html#salt.modules.publish.runner.

Since the command is being run as a batch, it splits the list of minions up into chunks and calls salt deploy.restart on each of them. If the sum of the time of all restarts is greater than the timeout of publish.runner any minions that didn't get called never do. Since we currently have a small number of minions this tends to often result in a single minion being left out. If we increased the number of minions, this would result in many more being left out.

I've opened a bug for this:

https://github.com/saltstack/salt/issues/10814

I've also added a pull request to allow a timeout argument for publish.runner:

https://github.com/saltstack/salt/pull/10815

In the meantime it's possible to workaround this issue by simply ensuring batching doesn't occur by setting the batch value to 100%:

service-restart --repo 'parsoid/deploy' --batch='100%'

I'm assuming that since parsoid is doing graceful restarts that this shouldn't be a problem.

My reply:

Without batching this is taking down the Parsoid cluster for at least a few seconds:

a) Parsoid stops accepting new requests while restarting. This typically
takes a few seconds, but can take up to 60 seconds if all workers are busy.

b) After the restart the first request per worker will be slightly
slower as the JIT is doing its thing.

With Varnish backend retries some of this might not result in
client-visible failures, although I am not sure that Varnish retries
non-idempotent POSTs. That would mean that page saves are failing.

I think I'll rather (ask a root to) do a

dsh -g parsoid service parsoid restart

instead until the salt bug is fixed.

A root could also run this on the salt master:

salt-run deploy.restart 'parsoid/deploy' '10%'

(In reply to Ryan Lane from comment #3)

A root could also run this on the salt master:

salt-run deploy.restart 'parsoid/deploy' '10%'

Will that avoid the timeout?

Yes. The bug is calling the runner from a peer. If it's called on the master directly it works as expected.

Any of the three commands listed in comment 1 will work if run from the master.

(In reply to Ryan Lane from comment #3)

A root could also run this on the salt master:

salt-run deploy.restart 'parsoid/deploy' '10%'

to confirm, I ran that yesterday as root, got a "completed" after a little while and per Gabriel the result was good.

21:07 mutante: restarting parsoid with salt-run deploy.restart 'parsoid/deploy' '10%'

Ryan Lane wrote:

The problem here is that publish.runner has a default timeout. When that
timeout is reached, the runner stops executing. Part of the issue is that it
times out at all. The other part of this issue is that this command has no
timeout argument:
http://docs.saltstack.com/en/latest/ref/modules/all/salt.modules.publish.html#salt.modules.publish.runner.

<snip>

I've also added a pull request to allow a timeout argument for
publish.runner:

https://github.com/saltstack/salt/pull/10815

Ryan patch to add a timeout to salt.modules.publish.runner has been merged (as commit 72dfd80e in https://github.com/saltstack/salt/ ). It is included in:

$ git tag --contains 72dfd80e
v2014.7
v2014.7.0rc1
v2014.7.0rc2
$

@greg can you tell me is this really high priority with no assignee?

greg lowered the priority of this task from High to Medium.Jan 6 2015, 11:45 PM

@greg can you tell me is this really high priority with no assignee?

nope.

So I guess the 'fix' is to upgrade to a newer version of salt?

So I guess the 'fix' is to upgrade to a newer version of salt?

As per my earlier comment yes. Fix included in v2014.7

Now that salt is upgraded, we need to fix up /usr/local/bin/service-restart so that it takes a timeout and the format of the args to publish.runner is updated.

Change 210877 had a related patch set uploaded (by Greg Grossmeier):
service-restart: fix up for new salt version

https://gerrit.wikimedia.org/r/210877

Change 210877 merged by ArielGlenn:
service-restart: fix up for new salt version

https://gerrit.wikimedia.org/r/210877

I chatted with akosiaris about this, I had not been able to get anything useful after digging around but I will have another go at it this week.

I've looked at this some over the weekend. Some notes:

It looks like the timeout is not passed through from service-restart all the way to the low-level function that needs it, but is overridden somewhere in the middle, I will have to double-check this.

In the meantime the batch queries appear not to be properly staggered, from what I can see from the check-in times from the minions to redis. What we want is that one batch doesn't start until all the restarts are done, and that means checking that the restart actually happened. Salt being async does not such thing of course, so it would need to be built into the service-restart script, or a wait of a set number of seconds, configurable per repo, would have to be added as an option.

And speaking of redis returners, it looooks like the patch allowing returners to be specified to publish.runner is in the salt version we use, but I don't see returns getting into redis from the runs. In either case this should get fixed, whether I backport a patch or it's already there. Almost certainly having returns come back to the master is contributing to cscott's problem (T102039). And not having the returns stored someplace, with salt async and the returns perhaps coming in after the script is completed, is just wrong.

And finally, the output displayed by the git deploy reporting mechanism is only good if all has gone well and the return is completed on the same day as the start of the run. If something goes awry it will blithely report the difference in minutes, ignoring the fact that the old run which completed might have been days or months ago. (Saw this for git-info for the restart job for parsoid/deploy.)

I'll be slugging away at this some more over the next couple of days.

(12:32:36 πμ) subbu: ssastry@mira:/srv/deployment/parsoid/deploy$ git deploy service restart
(12:32:36 πμ) subbu: Error received from salt; raw output:
(12:32:36 πμ) subbu: 'deploy.restart' runner publish timed out
(12:32:37 πμ) subbu: ssastry@mira:/srv/deployment/parsoid/deploy$

so this is still the main issue, need to poke at it some in the next few days.

OK, tracked it down to runners/publish.py, in runner(): the send call needs to have the timeout keyword passed to it, and then there is no SaltReqTimeoutError raised until the user specified timeout is reached. I.e.

return sreq.send(load, timeout=timeout)

At this point it would then be on the user to decide how long a timeout (i.e. 300 seconds if that's long enough for parsoid to restart gracefully).
How long does parsoid wait before killing processes with prejudice?

On Monday if there is another parsoid deploy, I can live hack in that one line change for testing purposes, to verify it fir git deploy restart. (I've been testing locally against a custom "slow runner" instead.)

hm, testing failed for want of an argument to the git deploy restart code (trigger). and service-restart no longer gets deployed. we shouldn't really require people to run the saltcall command so I'll look into this and fix it up in the next day or so.

https://gerrit.wikimedia.org/r/#/c/269450/ to fix the runner so it has a timeout option. the other two places that need fixes are in trigger, one in shell.py and one in the actual driver (local.py).

Change 269465 had a related patch set uploaded (by ArielGlenn):
adapt trebuchet-trigger for timeout to restart function

https://gerrit.wikimedia.org/r/269465

The next round osf salt packages will include the changes to the runner publish function. There is stil an issue in trebuchet-trigger with the proper calculation of the timeout.

I have built packages that contain a few other changes as well, see https://gerrit.wikimedia.org/r/#/c/276191/

This patch may or may not get the timeout right for one batch. We'll see.

ArielGlenn moved this task from active to testing needed on the Salt board.
demon subscribed.

Nobody cares about Trebuchet anymore.

Change 269465 abandoned by ArielGlenn:
adapt trebuchet-trigger for timeout to restart function

Reason:
salt is gone.

https://gerrit.wikimedia.org/r/269465