Unbridled retries are a serious issue (* - see Context below), and we're seeing a case now where Restbase is dropping the connection from varnish with no response after a 2 minute timeout, and Varnish is retrying these requests, many times in a row and/or multiplicatively, in spite of no such retry logic existing in our VCL.
The test URL given last night was: https://fr.wikipedia.org/api/rest_v1/page/html/Liste_des_plan%C3%A8tes_mineures_non_num%C3%A9rot%C3%A9es_d%C3%A9couvertes_en_2006/123378288 (various fake query-params can be applied to get a unique variant for cache/coalesce purposes which still fails).
What I've personally observed in testing this so far:
- When querying RB directly (using the mangled internal form of the URL, e.g. http://restbase.svc.eqiad.wmnet/fr.wikipedia.org/v1/page/html/Liste_des_plan%C3%A8tes_mineures_non_num%C3%A9rot%C3%A9es_d%C3%A9couvertes_en_2006/123378288, the most-common behavior I've witnessed is that RB stalls for almost exactly 2 full minutes and then abruptly terminates the connection with zero bytes output (no HTTP-level response).
- When leaving curl open on the public-facing test URL, I've seen it go at least 48 minutes (!!) with curl still waiting on a response, without our Varnishes ever closing the connection or returning a response.
- I did some custom logging using my own unique header fields to single out my own queries separately from others while the above 48-minute query was running. The curl was to the codfw frontends, and I was logging only for traffic generated by my request, on the backend side (-b) of the codfw backends to see what varnish was sending to the applayer. The first result I got (several minutes in) was: P4387, showing it retrying at least 4 times internally in Varnish without giving up or even invoking any VCL (where we wouldn't chose to do so in e.g. vcl_backend_error) or max_retries logic. Afterwards (unfortunately uncaptured) there were eventually many other repeats of the same sequence in the shm log spaced several minutes apart. Twice in that extended data there were other results: once it actually generated a 503 (which probably bubbled up to the frontend and allowed our usual single 503-retry, restarting the whole affair...), and another time it appeared to pass on an RB-generated non-sensical 413 response.
Some digging into Varnish's code based on the error trace in P4387 leads me to the conclusion that these retries are happening in the extrachance support in Varnish's vbe_dir_gethdrs, and they're potentially unbounded; under the right conditions, varnish could retry a backend request indefinitely without ever asking VCL about it again). I've got a proposed workaround-patch to disable extrachance completely in P4388. It's not a proper answer or a proper patch for upstream - there are better ways to go about fixing this, and perhaps even the notion of extrachance here could be salvaged with a more-precise fix. But I think on the balance, we're better off allowing intermittent failures without it than looping on uncontrollable bereq retries with it.
Way back during T97204 and related, we basically decided that the Traffic layers as a whole entity should have the following basic retry behaviors on errors:
- At our outermost (closest to the user) edge: If a connection failure to the next immediate layer happens happens (no legit HTTP response, which we consider a synthetic or implicit 503 state, and would generate a 503 towards the client if left unmitigated), or a legitimate 503 response is received from the backend, retry the entire transaction exactly once.
- In all other cases (non-503 errors in the front edge, all errors in the intermediate/lower-level caches), errors should be forwarded upstream to the requesting "client" and not retried.
- Rationale: if we retry anything in the intermediate/backend layers, retries on persistent failures tend to multiply. Even a single-retry policy, if applied at many layers, quickly becomes e.g. a 16-retry policy from the global view of the whole of the Traffic layer. The singular retry of 503-like conditions at the front edge is the only exception, as it helps to paper over transient/intermittent issues to the users' benefit without causing multiplicative storms of requests at deeper layers of our infrastructure.