Pybal's health monitoring is intentionally aggressive and sensitive: it tries to detect and respond to failures as quickly as possible, in the name of reducing the window of impact. Today any monitor failure immediately deletes the backend. We'd like to tone that down a little bit by adding support for weight=0 -style depooling as a first step, and not doing the backend deletion until we have more confirmation of failure. There's some nits to think through about multiple concurrent monitors of different types on the same service as well. Perhaps a strategy something like:
- When a single monitor fails once, weight=0 (recovers to full weight if succeeds next check)
- When the same single monitor fails for a second time, delete backend (or for debate, wait for 3rd failure? Kind of depends on timings, too).
- In the case of multiple monitors:
- So long as any one monitor's most-recent check was healthy, follow the rules above per-monitor (we'll have to reach 2+ failures of at least one of the monitors before deletion)
- If all monitors' most-recent checks of this service have failed, switch to deletion early (so if there are 3 distinct monitors and they all trip in a short time-window before the first of them has a chance to check a second time, we consider that just as significant as a single monitor failing more than once in a row).