Some notes from early April on how .retries works, and what the values are in practice for various directors post-templating on our clusters (stolen from P485):
for all directors used below, the default "retries" is num_backends the purpose of director-level "retries" is for skipping over backends which have failed healthcheck probes; it does *not* apply to e.g. retrying a 5xx for this single request. for chash, retries is how many raw slots in the continuum will be searched for retry, which must take into account chash slot duplication (which is set by per-backend weight). e.g. if weights are all 100 and there are 8 backends, a retries value of 8 in the face of probe failures gets to check 8 adjacent slots in the continuum, which has a total of 800 slots evenly mixed with the 8 backends... backend_weight = 100 when used below bits: all: director retries (hash) = default=num_backends text: t1 be: director retries (random) = default=num_backends all else: director retries (chash) = default=num_backends <- probably quite bad upload: t[12] fe: director retries (chash) = default=num_backends <- probably quite bad t1 be: director retries (random) = 2 <- why so tiny? random retries are cheap... t2 be: director retries (chash) = backend_weight * 4 <- probably a little bad mobile: t[12] fe: director retries (chash) = backend_weight * num_backends <- probably sorta-correct t1 be: director retries (random) = 2 <- why so tiny? random retries are cheap... t2 be: director retries (chash) = backend_weight * 4 <- probably accidentally sorta-correct due to num_backends==4 parsoid: (copypasta from mobile?) fe: director retries (chash) = backend_weight * num_backends <- probably sorta-correct be: director retries (hash) = 2
See also code comment added during recent refactoring here: https://github.com/wikimedia/operations-puppet/blob/aaec215168aee691a0773749c454b5639052579c/modules/role/manifests/cache/2layer.pp#L16
These should probably be fixed as follows:
random directors -> remove any manual .retries settings, use defaulting to backend count, which effectively means if any backend is healthy, that backend will be used
hash directors -> might normally require more thought, as below, but we're only using them in two places (parsoid/bits) and it's all moot there. Either remove the config for it and let it default, or just leave it alone until these clusters go away.
chash directors -> ideally, we need to fix this behavior down in the chash code so that code takes care of it in some more elegant manner and allows us to treat the retries parameter as if it were relative to backend count like the other directors. In the interim, setting them all the $backend_weight_avg * $num_backends makes approximate sense. Actually setting it to the sum of the weights would make even more sense, if that's relatively easy to do.