Page MenuHomePhabricator

Depool / repool scripts execute successfully even when the host has not been (r|d)epooled
Closed, ResolvedPublic

Description

The depool / pool conftool scripts are used during Scap3 service deploys to depool and repool the target hosts before and after a service restart, respectively. However, in certain occasions the hosts are not actually (r|d)epooled, in spite of the scripts exiting with the 0 exit code. This effect has been observed during Parsoid deploys. Specifically, some hosts were not repooled by Pybal, and when a second depool command was issued, the script reported that its state has been changed from pooled=no to pooled=yes. The specifics of why this is occurring need to be investigated and fixed.

In the short term, we need to find a way to reliably check if a host has been (r|d)epooled and act accordingly (wait, force the command one more time, etc).

Event Timeline

Change 310454 had a related patch set uploaded (by Mobrovac):
Conftool: Create script that checks the state after (de)pooling

https://gerrit.wikimedia.org/r/310454

Dzahn triaged this task as High priority.Sep 22 2016, 2:32 AM

When T147480 will be reolved, this ticket will be partly solved; at least pool/depool will exit with non-zero exit code if writing to etcd failed for any reason.

Stashbot subscribed.

Mentioned in SAL (#wikimedia-operations) [2016-10-06T15:41:21Z] <_joe_> upgrading conftool to 0.3.1 on all mw*, wtp* servers, T147480 T145518

Joe moved this task from Backlog to Blocking others on the User-Joe board.

In the larger context of restarting safely a service behind LVS without relying on the ability of pybal to be very quick in depooling servers, we will need to do something along the following lines:

  1. Check the health of the pool from both load balancers
  2. If the status is consistent and health is OK, depool the server from lvs
  3. If the command was successful, check the status of the server from both lvs; else fail (possibly try to repool?)
  4. If the status is not pooled, restart the service
  5. Once the service is restarted and after a configurable amount of time, repool the server
  6. If successful, check the status of the server on both lvs
  7. Retry for N attempts every M seconds (configurable) to check if consistently pooled
  8. If not, fail

Please note that every one of the steps above can be refined in a ton of ways, but this is what I'd say is needed to have a "safe depool" restart.

Change 310454 merged by Giuseppe Lavagetto:
Conftool: Create script that checks the state after (de)pooling

https://gerrit.wikimedia.org/r/310454

Change 315936 had a related patch set uploaded (by Mobrovac):
Parsoid: Install the conftool service scripts

https://gerrit.wikimedia.org/r/315936

Change 315936 merged by Giuseppe Lavagetto:
Parsoid: Install the conftool service scripts

https://gerrit.wikimedia.org/r/315936

However, in certain occasions the hosts are not actually (r|d)epooled, in spite of the scripts exiting with the 0 exit code. This effect has been observed during Parsoid deploys. ... The specifics of why this is occurring need to be investigated and fixed.

From T149115,

Check 'repool' failed: Pooling wtp2017.codfw.wmnet from service=parsoid...\nERROR:conftool:Error when trying to set/pooled=yes on service=parsoid,name=wtp2017.codfw.wmnet\nERROR:conftool:Failure writing to the kvstore: Backend error: Raft Internal Error : etcdserver: request timed out\n