Page MenuHomePhabricator

Varnish4 is unexpectedly retrying certain applayer failure cases
Closed, ResolvedPublic


Unbridled retries are a serious issue (* - see Context below), and we're seeing a case now where Restbase is dropping the connection from varnish with no response after a 2 minute timeout, and Varnish is retrying these requests, many times in a row and/or multiplicatively, in spite of no such retry logic existing in our VCL.

The test URL given last night was: (various fake query-params can be applied to get a unique variant for cache/coalesce purposes which still fails).

What I've personally observed in testing this so far:

  1. When querying RB directly (using the mangled internal form of the URL, e.g. http://restbase.svc.eqiad.wmnet/, the most-common behavior I've witnessed is that RB stalls for almost exactly 2 full minutes and then abruptly terminates the connection with zero bytes output (no HTTP-level response).
  2. When leaving curl open on the public-facing test URL, I've seen it go at least 48 minutes (!!) with curl still waiting on a response, without our Varnishes ever closing the connection or returning a response.
  3. I did some custom logging using my own unique header fields to single out my own queries separately from others while the above 48-minute query was running. The curl was to the codfw frontends, and I was logging only for traffic generated by my request, on the backend side (-b) of the codfw backends to see what varnish was sending to the applayer. The first result I got (several minutes in) was: P4387, showing it retrying at least 4 times internally in Varnish without giving up or even invoking any VCL (where we wouldn't chose to do so in e.g. vcl_backend_error) or max_retries logic. Afterwards (unfortunately uncaptured) there were eventually many other repeats of the same sequence in the shm log spaced several minutes apart. Twice in that extended data there were other results: once it actually generated a 503 (which probably bubbled up to the frontend and allowed our usual single 503-retry, restarting the whole affair...), and another time it appeared to pass on an RB-generated non-sensical 413 response.

Some digging into Varnish's code based on the error trace in P4387 leads me to the conclusion that these retries are happening in the extrachance support in Varnish's vbe_dir_gethdrs, and they're potentially unbounded; under the right conditions, varnish could retry a backend request indefinitely without ever asking VCL about it again). I've got a proposed workaround-patch to disable extrachance completely in P4388. It's not a proper answer or a proper patch for upstream - there are better ways to go about fixing this, and perhaps even the notion of extrachance here could be salvaged with a more-precise fix. But I think on the balance, we're better off allowing intermittent failures without it than looping on uncontrollable bereq retries with it.

* Context

Way back during T97204 and related, we basically decided that the Traffic layers as a whole entity should have the following basic retry behaviors on errors:

  1. At our outermost (closest to the user) edge: If a connection failure to the next immediate layer happens happens (no legit HTTP response, which we consider a synthetic or implicit 503 state, and would generate a 503 towards the client if left unmitigated), or a legitimate 503 response is received from the backend, retry the entire transaction exactly once.
  2. In all other cases (non-503 errors in the front edge, all errors in the intermediate/lower-level caches), errors should be forwarded upstream to the requesting "client" and not retried.
  3. Rationale: if we retry anything in the intermediate/backend layers, retries on persistent failures tend to multiply. Even a single-retry policy, if applied at many layers, quickly becomes e.g. a 16-retry policy from the global view of the whole of the Traffic layer. The singular retry of 503-like conditions at the front edge is the only exception, as it helps to paper over transient/intermittent issues to the users' benefit without causing multiplicative storms of requests at deeper layers of our infrastructure.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
BBlack renamed this task from Varnish4 is unexpected retrying certain applayer failure cases to Varnish4 is unexpectedly retrying certain applayer failure cases.Nov 8 2016, 11:52 AM
BBlack updated the task description. (Show Details)

Change 320371 had a related patch set uploaded (by Ema):
4.1.3-1wm3: Add 0005-remove_bad_extrachance_code.patch

Change 320371 merged by Ema:
4.1.3-1wm3: Add 0005-remove_bad_extrachance_code.patch

Mentioned in SAL (#wikimedia-operations) [2016-11-08T12:52:28Z] <ema> upgrading pinkunicorn to varnish 4.1.3-1wm3 T150247

Pasting from IRC after testing the patch on cp1008/pinkunicorn (TL;DR - patch works as expected, bad retries gone):

13:04 < bblack> -   Timestamp      Resp: 1478610240.194538 240.007117 0.000094
13:04 < bblack> ^ works as expected
13:05 < bblack> backend fails with a 503 after 2 minutes.  frontend does the retry-503-once logic and does it all over again, for a total 
                503-to-user response time of 4 minutes.
13:05 < bblack> not ideal, but now we're into the realm of tuning timeouts and/or fixing RB's lack of output, or parsoid's long hang
13:06 < bblack> instead of the realm of "varnish gets stuck internally infinitely retrying this stupid request against the backend and never 
                returning to the client"
13:07 < bblack> the two times the backend process queried RB, the results were"
13:07 < bblack> -   RespStatus     503
13:08 < bblack> -   Timestamp      Resp: 1478610120.190677 120.002858 0.000037
13:08 < bblack> interestingly, that's on the -c side of the backend process.  the -b side of the backend process never logged anything heh
13:09 < bblack> on the -b side of the frontend, we see 2x requests as well with the expected intentional retry:
13:09 < bblack> -   Timestamp      Bereq: 1478610000.187740 0.000154 0.000154
13:09 < bblack> -   Timestamp      Beresp: 1478610120.190730 120.003144 120.002990
13:09 < bblack> -   BerespProtocol HTTP/1.1
13:09 < bblack> -   BerespStatus   503
13:09 < bblack> -   VCL_call       BACKEND_RESPONSE
13:09 < bblack> -   VCL_return     retry
13:09 < bblack> -   BackendClose   72 74975fde-eb2f-45b2-9391-cc7731567691.be_cp1008_wikimedia_org
13:09 < bblack> -   Timestamp      Retry: 1478610120.190779 120.003193 0.000049
13:09 < bblack> -   Link           bereq 551044 retry
13:09 < bblack> *   << BeReq    >> 551044    
13:09 < bblack> -   Begin          bereq 686688 retry
13:09 < bblack> -   Timestamp      Bereq: 1478610120.190888 120.003302 0.000109
13:09 < bblack> -   Timestamp      Beresp: 1478610240.194227 240.006641 120.003339
13:09 < bblack> -   BerespProtocol HTTP/1.1
13:09 < bblack> -   BerespStatus   503
13:09 < bblack> -   VCL_call       BACKEND_RESPONSE
13:09 < bblack> -   TTL            VCL -1 3600 0 1478610240
13:09 < bblack> -   VCL_return     deliver
13:10 < bblack> -   Timestamp      BerespBody: 1478610240.194471 240.006885 0.000244
13:10 < bblack> the -c side of frontend only logged once, after the whole 240 seconds was up, indicating 503-to-user

Mentioned in SAL (#wikimedia-operations) [2016-11-08T16:18:04Z] <ema> upgrading cache_text codfw to varnish 4.1.3-1wm3 T150247

Mentioned in SAL (#wikimedia-operations) [2016-11-08T17:19:19Z] <ema> upgrade finished -> cache_text codfw to varnish 4.1.3-1wm3 T150247

I've confirmed with codfw upgraded (the DC that's currently facing RestBase directly), this is fixed up even from ulsfo clients' perspective. This specific issue is basically fixed now, although we have some followup to do in two directions here before closing this:

  1. File a bug upstream with Varnish and see what direction they want to go in here and how that affects us.
  2. Check in with RB that the current behavior is as they expect and it's ended their request-storm behavior, and if they don't already have an open bug about the RB 2 minute timeout->disconnect (with no response), file one, because while the Varnish layers do need to handle that case, ideally it should be rare due to bugs, rather than expected normal behavior.

FYI, we are also bumping up the HTTP socket timeout in hyperswitch from 2 to 6 minutes:

This timeout was responsible for the occasional prematurely closed connections. The socket timeout would trigger when no other requests were sent on the same socket in the meantime.

Mentioned in SAL (#wikimedia-operations) [2016-11-09T10:04:52Z] <ema> upgrading cache_text ulsfo to varnish 4.1.3-1wm3 T150247

Mentioned in SAL (#wikimedia-operations) [2016-11-09T11:09:44Z] <ema> finished upgrading cache_text ulsfo to varnish 4.1.3-1wm3 T150247

We've proposed a patch introducing a varnishd parameter limiting the number of extrachance retries:

Change 324229 had a related patch set uploaded (by Ema):
varnish: disable extrachance

Mentioned in SAL (#wikimedia-operations) [2016-11-29T16:49:49Z] <ema> setting gethdr_extrachance=0 on all cp* hosts T150247

Change 324229 merged by Ema:
varnish: disable extrachance

ema claimed this task.
ema added a subscriber: mobrovac.

Fix confirmed by @mobrovac, closing

Change 517359 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: stop passing gethdr_extrachance to varnish

Change 517359 merged by Ema:
[operations/puppet@production] cache: stop passing gethdr_extrachance to varnish