Page MenuHomePhabricator

PyBal Feature: progressive depooling strategy for monitored failures
Open, MediumPublic


Pybal's health monitoring is intentionally aggressive and sensitive: it tries to detect and respond to failures as quickly as possible, in the name of reducing the window of impact. Today any monitor failure immediately deletes the backend. We'd like to tone that down a little bit by adding support for weight=0 -style depooling as a first step, and not doing the backend deletion until we have more confirmation of failure. There's some nits to think through about multiple concurrent monitors of different types on the same service as well. Perhaps a strategy something like:

  1. When a single monitor fails once, weight=0 (recovers to full weight if succeeds next check)
  2. When the same single monitor fails for a second time, delete backend (or for debate, wait for 3rd failure? Kind of depends on timings, too).
  3. In the case of multiple monitors:
    1. So long as any one monitor's most-recent check was healthy, follow the rules above per-monitor (we'll have to reach 2+ failures of at least one of the monitors before deletion)
    2. If all monitors' most-recent checks of this service have failed, switch to deletion early (so if there are 3 distinct monitors and they all trip in a short time-window before the first of them has a chance to check a second time, we consider that just as significant as a single monitor failing more than once in a row).

Event Timeline

It's also an interesting thought to consider progressively scaling the weight. For example, you could make the strategy configurable such that the first failure sets weight=configured_weight*0.5, the next weight=0, and the next deletes. However, the way that weighting is handled in sh for the public services is not ideal (excess churn due to lack of true chashing), so it's probably best to avoid staging through smaller shifts of weight until some future time when we've got a proper chashing ipvs scheduler.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!