Depool / repool scripts execute successfully even when the host has not been (r|d)epooled
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• mobrovac
	Sep 13 2016, 2:48 PM

Description

The depool / pool conftool scripts are used during Scap3 service deploys to depool and repool the target hosts before and after a service restart, respectively. However, in certain occasions the hosts are not actually (r|d)epooled, in spite of the scripts exiting with the 0 exit code. This effect has been observed during Parsoid deploys. Specifically, some hosts were not repooled by Pybal, and when a second depool command was issued, the script reported that its state has been changed from pooled=no to pooled=yes. The specifics of why this is occurring need to be investigated and fixed.

In the short term, we need to find a way to reliably check if a host has been (r|d)epooled and act accordingly (wait, force the command one more time, etc).

Details

	Subject	Repo	Branch	Lines +/-
	Parsoid: Install the conftool service scripts	operations/puppet	production	+7 -0
	Conftool: Create script that checks the state after (de)pooling	operations/puppet	production	+235 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• mobrovac	T144602 Depool and repool SCB services during deploys
		Resolved		• mobrovac	T145518 Depool / repool scripts execute successfully even when the host has not been (r\|d)epooled

Event Timeline

• mobrovac created this task.Sep 13 2016, 2:48 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 13 2016, 2:48 PM

• mobrovac updated the task description. (Show Details)Sep 13 2016, 2:52 PM

• mobrovac added a parent task: T144602: Depool and repool SCB services during deploys.Sep 13 2016, 5:59 PM

Change 310454 had a related patch set uploaded (by Mobrovac):
Conftool: Create script that checks the state after (de)pooling

https://gerrit.wikimedia.org/r/310454

gerritbot added a project: Patch-For-Review.Sep 13 2016, 10:34 PM

Dzahn triaged this task as High priority.Sep 22 2016, 2:32 AM

• mobrovac claimed this task.Sep 22 2016, 11:56 AM

When T147480 will be reolved, this ticket will be partly solved; at least pool/depool will exit with non-zero exit code if writing to etcd failed for any reason.

Mentioned in SAL (#wikimedia-operations) [2016-10-06T15:41:21Z] <_joe_> upgrading conftool to 0.3.1 on all mw*, wtp* servers, T147480 T145518

Joe added a project: User-Joe.Oct 7 2016, 6:21 AM

Joe moved this task from Backlog to Blocking others on the User-Joe board.

Joe moved this task from Blocking others to Doing on the User-Joe board.Oct 7 2016, 10:05 AM

In the larger context of restarting safely a service behind LVS without relying on the ability of pybal to be very quick in depooling servers, we will need to do something along the following lines:

Check the health of the pool from both load balancers
If the status is consistent and health is OK, depool the server from lvs
If the command was successful, check the status of the server from both lvs; else fail (possibly try to repool?)
If the status is not pooled, restart the service
Once the service is restarted and after a configurable amount of time, repool the server
If successful, check the status of the server on both lvs
Retry for N attempts every M seconds (configurable) to check if consistently pooled
If not, fail

Please note that every one of the steps above can be refined in a ton of ways, but this is what I'd say is needed to have a "safe depool" restart.

Joe mentioned this in T147773: Restart HHVM on API appservers every about 48 hours.Oct 10 2016, 6:36 AM

Joe moved this task from Doing to Blocking others on the User-Joe board.Oct 11 2016, 4:33 PM

Joe moved this task from Blocking others to Doing on the User-Joe board.Oct 12 2016, 2:54 PM

Change 310454 merged by Giuseppe Lavagetto:
Conftool: Create script that checks the state after (de)pooling

https://gerrit.wikimedia.org/r/310454

Joe closed this task as Resolved.Oct 14 2016, 9:41 AM

Change 315936 had a related patch set uploaded (by Mobrovac):
Parsoid: Install the conftool service scripts

https://gerrit.wikimedia.org/r/315936

Change 315936 merged by Giuseppe Lavagetto:
Parsoid: Install the conftool service scripts

https://gerrit.wikimedia.org/r/315936

Arlolra mentioned this in T149115: Deploy failed on wtp2017.codfw.wmnet.Oct 25 2016, 9:06 PM

However, in certain occasions the hosts are not actually (r|d)epooled, in spite of the scripts exiting with the 0 exit code. This effect has been observed during Parsoid deploys. ... The specifics of why this is occurring need to be investigated and fixed.

From T149115,

Check 'repool' failed: Pooling wtp2017.codfw.wmnet from service=parsoid...\nERROR:conftool:Error when trying to set/pooled=yes on service=parsoid,name=wtp2017.codfw.wmnet\nERROR:conftool:Failure writing to the kvstore: Backend error: Raft Internal Error : etcdserver: request timed out\n

Depool / repool scripts execute successfully even when the host has not been (r|d)epooledClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Depool / repool scripts execute successfully even when the host has not been (r|d)epooled
Closed, ResolvedPublic
Actions

Related Objects
Search...