Page MenuHomePhabricator

pybal's "can-depool" logic only takes downServers into account
Open, Stalled, HighPublic

Description

When a server is detected as down, PyBal tries to depool it. Before doing so however, PyBal checks whether the operation can be performed. Every service has a configurable depool-threshold limiting the number of servers that can be depooled. The current "can-depool" logic looks like this:

return len(self.servers) - len(downServers) >= len(self.servers) * self.lvsservice.getDepoolThreshold()

If the above condition is true, the server can be depooled. However, the logic currently only takes servers marked as "down" into account, failing to consider servers administratively disabled.

Let's consider the following scenario:

depool-threshold0.5
total-servers4
down-servers0
disabled-servers2

In the given situation, canDepool currently returns true (4 - 0 >= 4 * 0.5), allowing the depool operation to proceed and leaving the service with only one pooled server. In this scenario we want instead to have a minimum of 2 servers always enabled and pooled. To enforce such constraint, the logic needs to be changed as follows:

return len(self.servers) - len(downServers) - len(disabledServers) >= len(self.servers) * self.lvsservice.getDepoolThreshold()

Event Timeline

ema triaged this task as High priority.Jan 11 2018, 1:19 PM
ema added a subscriber: fgiunchedi.

Change 403677 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@master] Use pooled-and-up servers in can-depool logic

https://gerrit.wikimedia.org/r/403677

I don't think this is entirely correct.

Pybal has 3 states:

  1. enabled/disabled: the logical state in etcd, so pooled=yes or pooled=no; servers with pooled=inactive are removed from pybal completely
  2. up/down: the state according to pybal's monitoring
  3. pooled/not pooled: the status of the server in the ipvs pool

So the problem is that what we want to achieve is "have at least N% of servers in the pool serving traffic", thus we're interested in the pooled/depooled state, not in the up/down state. So when we want to depool a server, either logically or because of monitoring, we should first check that we have enough pooled servers to be able to depool the current machine.

The problem is tricky to tackle, though, in the case we're depooling logically a server in a pool and one server that went down prevents it from being depooled - will pybal later recheck and depool it? I'm not sure.

pooled/not pooled: the status of the server in the ipvs pool

As we've found out today, pooled is actually defined as enabled and up. This should translate to "pooled in ipvs", but there's no guarantee about that.

Change 403677 merged by Ema:
[operations/debs/pybal@master] Use up-and-enabled servers in can-depool logic

https://gerrit.wikimedia.org/r/403677

Change 404680 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@1.14] Use up-and-enabled servers in can-depool logic

https://gerrit.wikimedia.org/r/404680

Change 404680 merged by Ema:
[operations/debs/pybal@1.14] Use up-and-enabled servers in can-depool logic

https://gerrit.wikimedia.org/r/404680

Mentioned in SAL (#wikimedia-operations) [2018-01-17T16:52:43Z] <ema> upgrade secondary LVSs to pybal 1.13.4 T184715, T184721

Mentioned in SAL (#wikimedia-operations) [2018-01-17T17:06:10Z] <ema> upgrade pybal on primary LVSs to 1.14.3 T184715, T184721

ema claimed this task.

the incident described in https://wikitech.wikimedia.org/wiki/Incident_documentation/20180626-LoadBalancers is not related to a bug in the canDepool logic itself, but with the fact that canDepool() is not being checked when a new configuration is received through EtcdConfigurationObserver.

these are the steps done when a new configuration is received by EtcdConfigurationObserver:

  1. EtcdConfigurationObserver calls coordinator.onConfigUpdate() and handles the newConfig to the coordinator
  2. The coordinator merges the config for each existing server, adds new servers and removes the ones that are not present on the newly received configuration. For the existing servers, coordinator.refreshModifiedServers refreshes server.pool value, using server.pool = server.enabled and server.up
  3. After any potential new server are initialized, coordinator._serverInitDone is called.
  4. coordinator._serverInitDone calls coordinator.assignServers to send the new config to IPVS

The bug is inside coordinator.assignServers:

coordinator.assignServers
# Hand over enabled servers to LVSService
self.lvsservice.assignServers(
    set([server for server in self.servers.itervalues() if server.pool]))

assignServers is only sending to LVSService.assignServers the servers that are marked as pooled (enabled and up) in the new configuration. So the depooled servers are going to get removed from IPVS in LVSService.assignServers:

LVSService.asignServers
def assignServers(self, newServers):
        cmdList = (
            [self.ipvsManager.commandAddServer(self.service(), server)
             for server in newServers - self.servers] +
            [self.ipvsManager.commandEditServer(self.service(), server)
             for server in newServers & self.servers] +
            [self.ipvsManager.commandRemoveServer(self.service(), server)
             for server in self.servers - newServers]
        )

the culprit is the third cmd, that gets rid of the servers that aren't present on the newServers parameter. That effectively depools all servers marked as enabled = False in etcd without honoring the canDepool logic

Change 443967 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/pybal@master] Ensure that depool threshold is being honored on new/updated configs

https://gerrit.wikimedia.org/r/443967

Nice work! :)

We should however think through the semantics of these boolean variables. I believe your current patchset violates some of the invariants set in server.py:

  1. P0: pool => enabled /\ ready
  2. P1: up => pool \/ !enabled \/ !ready
  3. P2: pool => up \/ !canDepool

Specifically, once the amount of "up" servers does not meet the threshold, pybal will pick any arbitrary servers to pool that are not both enabled and up, violating P0.

The intention of this ticket seems to be to make sure Pybal always keeps enough servers pooled, whether there are enough enabled servers or not. However the intention of that feature was originally to guard against not enough up servers, while disabled servers would never be pooled under any circumstances. (Rumor has it that it did in the past, but I haven't seen evidence of it yet, and if it did, that may have been a bug.) We may of course decide to change that, but this may cause issues in other places if we don't think it through.

We had a long and interesting discussion about this on IRC.

Summary:

  1. Pybal until now only attempts to guard against its own monitoring decisions, but does not attempt to guard against bad external input (i.e. setting servers as disabled in etcd, in file-based config server lists or k8s).
  2. While we could have Pybal enforce external/etcd disabled thresholds as well, that by itself will not influence behavior of other (deployment) tools like scap without additional modifications there.
  3. Pybal currently has no way to reject a server depooling, other than rather passively choosing not to act on it when certain thresholds are exceeded (e.g. verbosely rejecting the latest etcd config) and/or emitting warnings through monitoring. External tools like conftool could however monitor Pybal's behavior using Pybal's API.
  4. This is somewhat complicated by the fact that we have multiple Pybal instances running, and determining the currently active on is not trivial. Alternatively, all known (configured) Pybal instances could be tracked.

Various mid-to-long-term plans for Pybal and environment were also discussed on IRC.

Change 445207 had a related patch set uploaded (by Mark Bergsma; owner: Mark Bergsma):
[operations/debs/pybal@master] Test Server invariants

https://gerrit.wikimedia.org/r/445207

Change 447769 had a related patch set uploaded (by Mark Bergsma; owner: Mark Bergsma):
[operations/debs/pybal@master] Don't depool pooledDownServers in refreshPreexistingServers

https://gerrit.wikimedia.org/r/447769

Change 445207 merged by jenkins-bot:
[operations/debs/pybal@master] Test Server invariants

https://gerrit.wikimedia.org/r/445207

Change 447769 merged by jenkins-bot:
[operations/debs/pybal@master] Don't depool pooledDownServers in refreshPreexistingServer

https://gerrit.wikimedia.org/r/447769

Change 443967 merged by jenkins-bot:
[operations/debs/pybal@master] Ensure that depool threshold is being honored on new/updated configs

https://gerrit.wikimedia.org/r/443967

@ema: All related patches in Gerrit have been merged. Can this task be resolved (via Add Action...Change Status in the dropdown menu), or is there more to do in this task? Asking as you are set as task assignee. Thanks in advance!

Vgutierrez changed the task status from Open to Stalled.Oct 16 2020, 10:41 AM

this hasn't been backported to the 1.15 branch so it's never been deployed in production, I'd keep the task open

BBlack added a subscriber: BBlack.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

Resetting inactive assignee