[Trebuchet] Salt times out on parsoid restarts
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	• GWicke
	Feb 24 2014, 10:06 PM

Description

[13:29] <gwicke> we got an early timeout from salt during the parsoid deploy, which seems to be a salt misconfiguration
[13:31] <gwicke> I see cmd = ("sudo salt-call -l quiet --out json publish.runner "
[13:31] <gwicke> "deploy.restart '{0}','{1}'")
[13:32] <gwicke> in /usr/local/bin/service-restart
[13:32] <gwicke> on tin
[13:34] <gwicke> from http://docs.saltstack.com/ref/cli/salt.html: -t TIMEOUT, default 5
[13:35] <gwicke> I think I got the timeout around 5 seconds
[13:36] <gwicke> parsoid needs up to about 90 seconds for a graceful restart

Version: wmf-deployment
Severity: normal

Details

Reference: bz61882

	Subject	Repo	Branch	Lines +/-
	adapt trebuchet-trigger for timeout to restart function	operations/software/deployment/trebuchet-trigger	master	+19 -8
	service-restart: fix up for new salt version	operations/puppet	production	+5 -3

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Declined	ArielGlenn	T102039 Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments)
Declined	ArielGlenn	T63882 [Trebuchet] Salt times out on parsoid restarts
Resolved	ArielGlenn	T88971 Upgrade salt to 2014.7 (investigating)
Declined	yuvipanda	T72862 monitor unsigned salt keys
Resolved	ArielGlenn	T92276 upgrade salt on deployment-prep to 2014.7
Resolved	ArielGlenn	T98578 upgrade salt in labs
Resolved	ArielGlenn	T98580 upgrade salt on production cluster
Resolved	ArielGlenn	T98581 fix trebuchet-trigger (git deploy) publish.runner arguments
Declined	ArielGlenn	T128366 build and deploy salt packages with mininon crash fix and runner publish timeout fix

Event Timeline

• bzimport raised the priority of this task from to High.Nov 22 2014, 2:52 AM

• bzimport added projects: Deployments, acl*sre-team.

• bzimport set Reference to bz61882.

• GWicke created this task.Feb 24 2014, 10:06 PM

From Ryan's main:
TL;DR:

Ensure batching doesn't occur by setting the batch value to 100%:

service-restart --repo 'parsoid/deploy' --batch='100%'

Long version:

I spent some time tracking this down tonight. From the master the following commands work reliably:

1: salt -b '10%' -G 'deployment_target:parsoid' service.restart parsoid
2: salt -b '10%' -G 'deployment_target:parsoid' deploy.restart 'parsoid/deploy'
3: salt-run deploy.restart 'parsoid/deploy' '10%'

#2 is a wrapper function for service.restart that maps a repo to a service, then restarts it (it has other future uses that make it necessary as well). #3 is a runner that can be easily referenced from peers like tin to make handling security easier, it basically just calls #2.

From tin we call:

sudo salt-call -l quiet --out json publish.runner deploy.restart 'parsoid/deploy','10%'

This calls #3 on the master via salt's publication system.

The problem here is that publish.runner has a default timeout. When that timeout is reached, the runner stops executing. Part of the issue is that it times out at all. The other part of this issue is that this command has no timeout argument: http://docs.saltstack.com/ref/modules/all/salt.modules.publish.html#salt.modules.publish.runner.

Since the command is being run as a batch, it splits the list of minions up into chunks and calls salt deploy.restart on each of them. If the sum of the time of all restarts is greater than the timeout of publish.runner any minions that didn't get called never do. Since we currently have a small number of minions this tends to often result in a single minion being left out. If we increased the number of minions, this would result in many more being left out.

I've opened a bug for this:

https://github.com/saltstack/salt/issues/10814

I've also added a pull request to allow a timeout argument for publish.runner:

https://github.com/saltstack/salt/pull/10815

In the meantime it's possible to workaround this issue by simply ensuring batching doesn't occur by setting the batch value to 100%:

service-restart --repo 'parsoid/deploy' --batch='100%'

I'm assuming that since parsoid is doing graceful restarts that this shouldn't be a problem.

My reply:

Without batching this is taking down the Parsoid cluster for at least a few seconds:

a) Parsoid stops accepting new requests while restarting. This typically
takes a few seconds, but can take up to 60 seconds if all workers are busy.

b) After the restart the first request per worker will be slightly
slower as the JIT is doing its thing.

With Varnish backend retries some of this might not result in
client-visible failures, although I am not sure that Varnish retries
non-idempotent POSTs. That would mean that page saves are failing.

I think I'll rather (ask a root to) do a

dsh -g parsoid service parsoid restart

instead until the salt bug is fixed.

A root could also run this on the salt master:

salt-run deploy.restart 'parsoid/deploy' '10%'

(In reply to Ryan Lane from comment #3)

A root could also run this on the salt master:
salt-run deploy.restart 'parsoid/deploy' '10%'

Will that avoid the timeout?

Yes. The bug is calling the runner from a peer. If it's called on the master directly it works as expected.

Any of the three commands listed in comment 1 will work if run from the master.

(In reply to Ryan Lane from comment #3)

A root could also run this on the salt master:
salt-run deploy.restart 'parsoid/deploy' '10%'

to confirm, I ran that yesterday as root, got a "completed" after a little while and per Gabriel the result was good.

21:07 mutante: restarting parsoid with salt-run deploy.restart 'parsoid/deploy' '10%'

Ryan Lane wrote:

The problem here is that publish.runner has a default timeout. When that
timeout is reached, the runner stops executing. Part of the issue is that it
times out at all. The other part of this issue is that this command has no
timeout argument:
http://docs.saltstack.com/en/latest/ref/modules/all/salt.modules.publish.html#salt.modules.publish.runner.

<snip>

I've also added a pull request to allow a timeout argument for
publish.runner:
https://github.com/saltstack/salt/pull/10815

Ryan patch to add a timeout to salt.modules.publish.runner has been merged (as commit 72dfd80e in https://github.com/saltstack/salt/ ). It is included in:

$ git tag --contains 72dfd80e
v2014.7
v2014.7.0rc1
v2014.7.0rc2
$

greg moved this task from To Triage to Backlog (Tech) on the Deployments board.Nov 25 2014, 10:56 PM

@greg can you tell me is this really high priority with no assignee?

In T63882#958526, @chasemp wrote:

@greg can you tell me is this really high priority with no assignee?

nope.

• GWicke mentioned this in T87803: Cannot use dsh-based restart of parsoid from tin anymore.Jan 30 2015, 8:54 PM

So I guess the 'fix' is to upgrade to a newer version of salt?

In T63882#1006561, @yuvipanda wrote:

So I guess the 'fix' is to upgrade to a newer version of salt?

As per my earlier comment yes. Fix included in v2014.7

akosiaris mentioned this in T88971: Upgrade salt to 2014.7 (investigating).Feb 9 2015, 10:28 AM

akosiaris added a subtask: T88971: Upgrade salt to 2014.7 (investigating).

• GWicke mentioned this in T93433: Evaluate Ansible as a deployment tool.Apr 14 2015, 10:07 PM

Aklapper removed RyanLane as the assignee of this task.Apr 26 2015, 12:11 PM

ArielGlenn closed subtask T88971: Upgrade salt to 2014.7 (investigating) as Resolved.May 14 2015, 8:26 AM

Now that salt is upgraded, we need to fix up /usr/local/bin/service-restart so that it takes a timeout and the format of the args to publish.runner is updated.

Something like this: https://gerrit.wikimedia.org/r/#/c/210877/ (UNTESTED)

Change 210877 had a related patch set uploaded (by Greg Grossmeier):
service-restart: fix up for new salt version

https://gerrit.wikimedia.org/r/210877

gerritbot added a project: Patch-For-Review.May 14 2015, 3:22 PM

hashar unsubscribed.May 14 2015, 7:41 PM

Change 210877 merged by ArielGlenn:
service-restart: fix up for new salt version

https://gerrit.wikimedia.org/r/210877

ArielGlenn mentioned this in rOPUP267c6ecef79e: service-restart: fix up for new salt version.Jun 15 2015, 12:19 PM

@ArielGlenn what's the status?

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptJul 20 2015, 10:44 AM

@ArielGlenn what's the status?

@ArielGlenn, what's the status?

mark assigned this task to ArielGlenn.Aug 31 2015, 9:57 AM

mark subscribed.

I chatted with akosiaris about this, I had not been able to get anything useful after digging around but I will have another go at it this week.

ArielGlenn added a project: Salt.Sep 2 2015, 7:23 PM

ArielGlenn set Security to None.

ArielGlenn moved this task from Backlog to Up Next on the Salt board.Oct 2 2015, 10:04 AM

I've looked at this some over the weekend. Some notes:

It looks like the timeout is not passed through from service-restart all the way to the low-level function that needs it, but is overridden somewhere in the middle, I will have to double-check this.

In the meantime the batch queries appear not to be properly staggered, from what I can see from the check-in times from the minions to redis. What we want is that one batch doesn't start until all the restarts are done, and that means checking that the restart actually happened. Salt being async does not such thing of course, so it would need to be built into the service-restart script, or a wait of a set number of seconds, configurable per repo, would have to be added as an option.

And speaking of redis returners, it looooks like the patch allowing returners to be specified to publish.runner is in the salt version we use, but I don't see returns getting into redis from the runs. In either case this should get fixed, whether I backport a patch or it's already there. Almost certainly having returns come back to the master is contributing to cscott's problem (T102039). And not having the returns stored someplace, with salt async and the returns perhaps coming in after the script is completed, is just wrong.

And finally, the output displayed by the git deploy reporting mechanism is only good if all has gone well and the return is completed on the same day as the start of the run. If something goes awry it will blithely report the difference in minutes, ignoring the fact that the old run which completed might have been days or months ago. (Saw this for git-info for the restart job for parsoid/deploy.)

I'll be slugging away at this some more over the next couple of days.

ArielGlenn moved this task from Up Next to Backlog on the Salt board.Nov 23 2015, 2:20 PM

ArielGlenn added a parent task: T102039: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments).Feb 3 2016, 10:29 PM

ArielGlenn moved this task from Backlog to active on the Salt board.

(12:32:36 πμ) subbu: ssastry@mira:/srv/deployment/parsoid/deploy$ git deploy service restart
(12:32:36 πμ) subbu: Error received from salt; raw output:
(12:32:36 πμ) subbu: 'deploy.restart' runner publish timed out
(12:32:37 πμ) subbu: ssastry@mira:/srv/deployment/parsoid/deploy$

so this is still the main issue, need to poke at it some in the next few days.

OK, tracked it down to runners/publish.py, in runner(): the send call needs to have the timeout keyword passed to it, and then there is no SaltReqTimeoutError raised until the user specified timeout is reached. I.e.

return sreq.send(load, timeout=timeout)

At this point it would then be on the user to decide how long a timeout (i.e. 300 seconds if that's long enough for parsoid to restart gracefully).
How long does parsoid wait before killing processes with prejudice?

On Monday if there is another parsoid deploy, I can live hack in that one line change for testing purposes, to verify it fir git deploy restart. (I've been testing locally against a custom "slow runner" instead.)

hm, testing failed for want of an argument to the git deploy restart code (trigger). and service-restart no longer gets deployed. we shouldn't really require people to run the saltcall command so I'll look into this and fix it up in the next day or so.

https://gerrit.wikimedia.org/r/#/c/269450/ to fix the runner so it has a timeout option. the other two places that need fixes are in trigger, one in shell.py and one in the actual driver (local.py).

Change 269465 had a related patch set uploaded (by ArielGlenn):
adapt trebuchet-trigger for timeout to restart function

https://gerrit.wikimedia.org/r/269465

greg edited projects, added Trebuchet; removed Deployments.Feb 9 2016, 11:40 PM

The next round osf salt packages will include the changes to the runner publish function. There is stil an issue in trebuchet-trigger with the proper calculation of the timeout.

ArielGlenn moved this task from active to Blocked/Stalled on the Salt board.Feb 29 2016, 12:16 PM

ArielGlenn added a subtask: T128366: build and deploy salt packages with mininon crash fix and runner publish timeout fix.

I have built packages that contain a few other changes as well, see https://gerrit.wikimedia.org/r/#/c/276191/

This patch may or may not get the timeout right for one batch. We'll see.

ArielGlenn moved this task from Blocked/Stalled to active on the Salt board.Mar 9 2016, 11:18 PM

ArielGlenn moved this task from active to testing needed on the Salt board.