Page MenuHomePhabricator

pybal's "can-depool" logic only takes downServers into account
Open, HighPublic

Description

When a server is detected as down, PyBal tries to depool it. Before doing so however, PyBal checks whether the operation can be performed. Every service has a configurable depool-threshold limiting the number of servers that can be depooled. The current "can-depool" logic looks like this:

return len(self.servers) - len(downServers) >= len(self.servers) * self.lvsservice.getDepoolThreshold()

If the above condition is true, the server can be depooled. However, the logic currently only takes servers marked as "down" into account, failing to consider servers administratively disabled.

Let's consider the following scenario:

depool-threshold0.5
total-servers4
down-servers0
disabled-servers2

In the given situation, canDepool currently returns true (4 - 0 >= 4 * 0.5), allowing the depool operation to proceed and leaving the service with only one pooled server. In this scenario we want instead to have a minimum of 2 servers always enabled and pooled. To enforce such constraint, the logic needs to be changed as follows:

return len(self.servers) - len(downServers) - len(disabledServers) >= len(self.servers) * self.lvsservice.getDepoolThreshold()

Event Timeline

ema created this task.Jan 11 2018, 1:18 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 11 2018, 1:18 PM
ema triaged this task as High priority.Jan 11 2018, 1:19 PM
ema added a subscriber: fgiunchedi.
ema edited projects, added Pybal; removed Puppet.Jan 11 2018, 2:03 PM
ema moved this task from Backlog to In Progress on the Pybal board.Jan 11 2018, 3:11 PM

Change 403677 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@master] Use pooled-and-up servers in can-depool logic

https://gerrit.wikimedia.org/r/403677

Joe added a subscriber: Joe.Jan 12 2018, 10:03 AM

I don't think this is entirely correct.

Pybal has 3 states:

  1. enabled/disabled: the logical state in etcd, so pooled=yes or pooled=no; servers with pooled=inactive are removed from pybal completely
  2. up/down: the state according to pybal's monitoring
  3. pooled/not pooled: the status of the server in the ipvs pool

So the problem is that what we want to achieve is "have at least N% of servers in the pool serving traffic", thus we're interested in the pooled/depooled state, not in the up/down state. So when we want to depool a server, either logically or because of monitoring, we should first check that we have enough pooled servers to be able to depool the current machine.

The problem is tricky to tackle, though, in the case we're depooling logically a server in a pool and one server that went down prevents it from being depooled - will pybal later recheck and depool it? I'm not sure.

ema added a comment.Jan 12 2018, 11:18 AM

pooled/not pooled: the status of the server in the ipvs pool

As we've found out today, pooled is actually defined as enabled and up. This should translate to "pooled in ipvs", but there's no guarantee about that.

ema moved this task from Triage to LoadBalancer on the Traffic board.Jan 15 2018, 10:16 AM

Change 403677 merged by Ema:
[operations/debs/pybal@master] Use up-and-enabled servers in can-depool logic

https://gerrit.wikimedia.org/r/403677

Change 404680 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/pybal@1.14] Use up-and-enabled servers in can-depool logic

https://gerrit.wikimedia.org/r/404680

Change 404680 merged by Ema:
[operations/debs/pybal@1.14] Use up-and-enabled servers in can-depool logic

https://gerrit.wikimedia.org/r/404680

Mentioned in SAL (#wikimedia-operations) [2018-01-17T16:52:43Z] <ema> upgrade secondary LVSs to pybal 1.13.4 T184715, T184721

Mentioned in SAL (#wikimedia-operations) [2018-01-17T17:06:10Z] <ema> upgrade pybal on primary LVSs to 1.14.3 T184715, T184721

ema closed this task as Resolved.Jan 17 2018, 5:20 PM
ema claimed this task.
Aklapper assigned this task to ema.Jun 15 2018, 1:07 PM
Joe reopened this task as Open.Jun 27 2018, 1:57 PM
mark added a subscriber: mark.Jun 27 2018, 1:58 PM
Vgutierrez added a comment.EditedJul 4 2018, 4:43 PM

the incident described in https://wikitech.wikimedia.org/wiki/Incident_documentation/20180626-LoadBalancers is not related to a bug in the canDepool logic itself, but with the fact that canDepool() is not being checked when a new configuration is received through EtcdConfigurationObserver.

these are the steps done when a new configuration is received by EtcdConfigurationObserver:

  1. EtcdConfigurationObserver calls coordinator.onConfigUpdate() and handles the newConfig to the coordinator
  2. The coordinator merges the config for each existing server, adds new servers and removes the ones that are not present on the newly received configuration. For the existing servers, coordinator.refreshModifiedServers refreshes server.pool value, using server.pool = server.enabled and server.up
  3. After any potential new server are initialized, coordinator._serverInitDone is called.
  4. coordinator._serverInitDone calls coordinator.assignServers to send the new config to IPVS

The bug is inside coordinator.assignServers:

coordinator.assignServers
# Hand over enabled servers to LVSService
self.lvsservice.assignServers(
    set([server for server in self.servers.itervalues() if server.pool]))

assignServers is only sending to LVSService.assignServers the servers that are marked as pooled (enabled and up) in the new configuration. So the depooled servers are going to get removed from IPVS in LVSService.assignServers:

LVSService.asignServers
def assignServers(self, newServers):
        cmdList = (
            [self.ipvsManager.commandAddServer(self.service(), server)
             for server in newServers - self.servers] +
            [self.ipvsManager.commandEditServer(self.service(), server)
             for server in newServers & self.servers] +
            [self.ipvsManager.commandRemoveServer(self.service(), server)
             for server in self.servers - newServers]
        )

the culprit is the third cmd, that gets rid of the servers that aren't present on the newServers parameter. That effectively depools all servers marked as enabled = False in etcd without honoring the canDepool logic

Change 443967 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/pybal@master] Ensure that depool threshold is being honored on new/updated configs

https://gerrit.wikimedia.org/r/443967

mark added a comment.Jul 5 2018, 4:40 PM

Nice work! :)

We should however think through the semantics of these boolean variables. I believe your current patchset violates some of the invariants set in server.py:

  1. P0: pool => enabled /\ ready
  2. P1: up => pool \/ !enabled \/ !ready
  3. P2: pool => up \/ !canDepool

Specifically, once the amount of "up" servers does not meet the threshold, pybal will pick any arbitrary servers to pool that are not both enabled and up, violating P0.

The intention of this ticket seems to be to make sure Pybal always keeps enough servers pooled, whether there are enough enabled servers or not. However the intention of that feature was originally to guard against not enough up servers, while disabled servers would never be pooled under any circumstances. (Rumor has it that it did in the past, but I haven't seen evidence of it yet, and if it did, that may have been a bug.) We may of course decide to change that, but this may cause issues in other places if we don't think it through.

mark added a comment.Jul 11 2018, 12:51 PM

We had a long and interesting discussion about this on IRC.

Summary:

  1. Pybal until now only attempts to guard against its own monitoring decisions, but does not attempt to guard against bad external input (i.e. setting servers as disabled in etcd, in file-based config server lists or k8s).
  2. While we could have Pybal enforce external/etcd disabled thresholds as well, that by itself will not influence behavior of other (deployment) tools like scap without additional modifications there.
  3. Pybal currently has no way to reject a server depooling, other than rather passively choosing not to act on it when certain thresholds are exceeded (e.g. verbosely rejecting the latest etcd config) and/or emitting warnings through monitoring. External tools like conftool could however monitor Pybal's behavior using Pybal's API.
  4. This is somewhat complicated by the fact that we have multiple Pybal instances running, and determining the currently active on is not trivial. Alternatively, all known (configured) Pybal instances could be tracked.

Various mid-to-long-term plans for Pybal and environment were also discussed on IRC.

Change 445207 had a related patch set uploaded (by Mark Bergsma; owner: Mark Bergsma):
[operations/debs/pybal@master] Test Server invariants

https://gerrit.wikimedia.org/r/445207

Change 447769 had a related patch set uploaded (by Mark Bergsma; owner: Mark Bergsma):
[operations/debs/pybal@master] Don't depool pooledDownServers in refreshPreexistingServers

https://gerrit.wikimedia.org/r/447769

Change 445207 merged by jenkins-bot:
[operations/debs/pybal@master] Test Server invariants

https://gerrit.wikimedia.org/r/445207

Change 447769 merged by jenkins-bot:
[operations/debs/pybal@master] Don't depool pooledDownServers in refreshPreexistingServer

https://gerrit.wikimedia.org/r/447769

Change 443967 merged by jenkins-bot:
[operations/debs/pybal@master] Ensure that depool threshold is being honored on new/updated configs

https://gerrit.wikimedia.org/r/443967