Page MenuHomePhabricator

Depool / repool scripts execute successfully even when the host has not been (r|d)epooled
Closed, ResolvedPublic

Description

The depool / pool conftool scripts are used during Scap3 service deploys to depool and repool the target hosts before and after a service restart, respectively. However, in certain occasions the hosts are not actually (r|d)epooled, in spite of the scripts exiting with the 0 exit code. This effect has been observed during Parsoid deploys. Specifically, some hosts were not repooled by Pybal, and when a second depool command was issued, the script reported that its state has been changed from pooled=no to pooled=yes. The specifics of why this is occurring need to be investigated and fixed.

In the short term, we need to find a way to reliably check if a host has been (r|d)epooled and act accordingly (wait, force the command one more time, etc).

Details

Related Gerrit Patches:

Event Timeline

mobrovac created this task.Sep 13 2016, 2:48 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 13 2016, 2:48 PM
mobrovac updated the task description. (Show Details)Sep 13 2016, 2:52 PM

Change 310454 had a related patch set uploaded (by Mobrovac):
Conftool: Create script that checks the state after (de)pooling

https://gerrit.wikimedia.org/r/310454

Dzahn triaged this task as High priority.Sep 22 2016, 2:32 AM
Joe added a comment.Oct 6 2016, 3:35 PM

When T147480 will be reolved, this ticket will be partly solved; at least pool/depool will exit with non-zero exit code if writing to etcd failed for any reason.

Stashbot added a subscriber: Stashbot.

Mentioned in SAL (#wikimedia-operations) [2016-10-06T15:41:21Z] <_joe_> upgrading conftool to 0.3.1 on all mw*, wtp* servers, T147480 T145518

Joe moved this task from Backlog to Blocking others on the User-Joe board.
Joe moved this task from Blocking others to Doing on the User-Joe board.Oct 7 2016, 10:05 AM
Joe added a comment.Oct 7 2016, 10:13 AM

In the larger context of restarting safely a service behind LVS without relying on the ability of pybal to be very quick in depooling servers, we will need to do something along the following lines:

  1. Check the health of the pool from both load balancers
  2. If the status is consistent and health is OK, depool the server from lvs
  3. If the command was successful, check the status of the server from both lvs; else fail (possibly try to repool?)
  4. If the status is not pooled, restart the service
  5. Once the service is restarted and after a configurable amount of time, repool the server
  6. If successful, check the status of the server on both lvs
  7. Retry for N attempts every M seconds (configurable) to check if consistently pooled
  8. If not, fail

Please note that every one of the steps above can be refined in a ton of ways, but this is what I'd say is needed to have a "safe depool" restart.

Joe moved this task from Doing to Blocking others on the User-Joe board.Oct 11 2016, 4:33 PM
Joe moved this task from Blocking others to Doing on the User-Joe board.Oct 12 2016, 2:54 PM

Change 310454 merged by Giuseppe Lavagetto:
Conftool: Create script that checks the state after (de)pooling

https://gerrit.wikimedia.org/r/310454

Joe closed this task as Resolved.Oct 14 2016, 9:41 AM

Change 315936 had a related patch set uploaded (by Mobrovac):
Parsoid: Install the conftool service scripts

https://gerrit.wikimedia.org/r/315936

Change 315936 merged by Giuseppe Lavagetto:
Parsoid: Install the conftool service scripts

https://gerrit.wikimedia.org/r/315936

However, in certain occasions the hosts are not actually (r|d)epooled, in spite of the scripts exiting with the 0 exit code. This effect has been observed during Parsoid deploys. ... The specifics of why this is occurring need to be investigated and fixed.

From T149115,

Check 'repool' failed: Pooling wtp2017.codfw.wmnet from service=parsoid...\nERROR:conftool:Error when trying to set/pooled=yes on service=parsoid,name=wtp2017.codfw.wmnet\nERROR:conftool:Failure writing to the kvstore: Backend error: Raft Internal Error : etcdserver: request timed out\n