Page MenuHomePhabricator

Sanitize varnish director-level retries
Closed, ResolvedPublic

Description

Some notes from early April on how .retries works, and what the values are in practice for various directors post-templating on our clusters (stolen from P485):

for all directors used below, the default "retries" is num_backends

the purpose of director-level "retries" is for skipping over
backends which have failed healthcheck probes; it does *not* apply
to e.g. retrying a 5xx for this single request.

for chash, retries is how many raw slots in the continuum will be
searched for retry, which must take into account chash slot
duplication (which is set by per-backend weight). e.g. if weights
are all 100 and there are 8 backends, a retries value of 8 in the
face of probe failures gets to check 8 adjacent slots in the
continuum, which has a total of 800 slots evenly mixed with the 8
backends...

backend_weight = 100 when used below

bits:
   all:      director retries (hash)   = default=num_backends
text:
   t1 be:    director retries (random) = default=num_backends
   all else: director retries (chash)  = default=num_backends <- probably quite bad
upload:
   t[12] fe: director retries (chash)  = default=num_backends <- probably quite bad
   t1 be:    director retries (random) = 2 <- why so tiny? random retries are cheap...
   t2 be:    director retries (chash)  = backend_weight * 4 <- probably a little bad
mobile:
   t[12] fe: director retries (chash)  = backend_weight * num_backends <- probably sorta-correct
   t1 be:    director retries (random) = 2 <- why so tiny? random retries are cheap...
   t2 be:    director retries (chash)  = backend_weight * 4 <- probably accidentally sorta-correct due to num_backends==4
parsoid: (copypasta from mobile?)
   fe:       director retries (chash)  = backend_weight * num_backends <- probably sorta-correct
   be:       director retries (hash)   = 2

See also code comment added during recent refactoring here: https://github.com/wikimedia/operations-puppet/blob/aaec215168aee691a0773749c454b5639052579c/modules/role/manifests/cache/2layer.pp#L16

These should probably be fixed as follows:
random directors -> remove any manual .retries settings, use defaulting to backend count, which effectively means if any backend is healthy, that backend will be used
hash directors -> might normally require more thought, as below, but we're only using them in two places (parsoid/bits) and it's all moot there. Either remove the config for it and let it default, or just leave it alone until these clusters go away.
chash directors -> ideally, we need to fix this behavior down in the chash code so that code takes care of it in some more elegant manner and allows us to treat the retries parameter as if it were relative to backend count like the other directors. In the interim, setting them all the $backend_weight_avg * $num_backends makes approximate sense. Actually setting it to the sum of the weights would make even more sense, if that's relatively easy to do.

Event Timeline

BBlack claimed this task.
BBlack raised the priority of this task from to Medium.
BBlack updated the task description. (Show Details)
BBlack added projects: acl*sre-team, Traffic.
BBlack subscribed.

I should note there is some risk upping the retries to $backend_weight_avg * $num_backends in the currently non-conforming chash cases (text/upload chashes): in a scenario where a large number of backends were marked unhealthy from a given consuming-varnish's perspective for a very brief window, many requests could needlessly pass through the wrong backends in chash-binning terms, which would hurt the cache's contents for consumers that aren't having such an issue. It would be nice to resolve this by having it fail through, say, only 1/4 of the backends and not all of them, to limit any fallout from this kind of scenario. However, we have no reliable way to specify a fraction of the backends with how chash retries and weighting works.

The existing cases like upload t2-be where we're using $backend_weight_avg * 4 are probably trying to achieve that sort of effect, but with the intentional randomness of the continuum, this doesn't really have a predictable effect... in some cases it might still blow through every backend, and in others you may get <4 backends.

@Joe came up with some math so that we don't have to specify the whole size of the ring, but still get X% of odds of trying all backends' health before giving up. Patch defaulting all the chash backends to that and stripping non-default retries from the others here: https://gerrit.wikimedia.org/r/#/c/212543

^ above change + a couple follow-on syntax fixups nits is deployed, doesn't seem to have broken anything, and hopefully improves some of the less-ideal situations from before.