Page MenuHomePhabricator

RFC: Re-evaluate varnish-level request-restart behavior on 5xx
Closed, ResolvedPublic

Description

Our Varnish config currently retries 503 responses a fixed number of times, between 1 (text) and 4 (Parsoid,bits) times.

To avoid exacerbating overload situations, we should only retry 503s if a Retry-After header signals that this is permissible.

See also: T97204: RFC: Request timeouts and retries

Details

Related Gerrit Patches:

Event Timeline

GWicke raised the priority of this task from to High.
GWicke updated the task description. (Show Details)
GWicke added subscribers: mobrovac, daniel, faidon and 5 others.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 25 2015, 2:06 AM
GWicke updated the task description. (Show Details)Apr 25 2015, 2:12 AM
GWicke set Security to None.
GWicke updated the task description. (Show Details)Apr 25 2015, 9:12 PM
BBlack moved this task from Triage to Up Next on the Traffic board.Apr 26 2015, 2:35 PM
BBlack claimed this task.EditedApr 26 2015, 2:46 PM
BBlack added a subscriber: Arlolra.

Re: Retry-After: I don't think we can (or want to) do anything with a backend's timeout value here in varnish. We can't actually pause a request and hang around for e.g. 5 seconds at this level. Either we retry immediately (to a different backend from a pool), or we bail out with an error to the user. We could use it as a boolean signal, but then we're completely ignoring any time advice there. We could potentially use the cache obj/ttl infrastructure to try to make Retry-After work for further requests per-resource, but again (as I said elsewhere) I don't think Retry-After has per-resource semantics anyways. Retry-After is probably something we only want to set (if anywhere) at the very outer layer heading back to an outside-of-WMF consumer.

However, this whole varnish request-restart thing could use some clarity and re-thinking in general (I think I might retitle this to be a bit more generic).


Relatedly, since undoubtedly it's the parsoidcache case that triggered you looking at this: we weren't setting retry503 for the parsoidcache backend until very recently. @Arlolra asked for it and submitted a patch for it back on Mar 30, which I merged on Apr 17th:

https://gerrit.wikimedia.org/r/#/c/200732/
T96239
T94139

I don't think I actually ever got any confirmation about whether merging this change had any intended effect in practice that they were looking for, and I now suspect it's a no-op in the presence of retry5xx as well, given my current understanding (see below).


The actual meaning of our current retry5xx and retry503 in the code/manifests are not easy to grok at first glance, and neither are the implications of them or the reasons for using them. The below is my best attempt to explain current behavior and possible reasoning behind it, not to justify retaining it in the present necessarily (you wouldn't believe how many times I had to question myself and research/edit the text below again...):

Our retry5xx parameter is a boolean, which controls (at the vcl_fetch level) whether we do a restart for any code in the range 500 - 504 received from a backend. If that boolean is true, then we retry the backend up to 4 total times (on the 4th attempt, req.restarts is 3), then give up and deliver the backend's error message to the client.

Our retry503 parameter is a number, and it controls how many times we restart a 503 at the vcl_error level before allowing vcl_error to deliver an error (commonly, a synthetic one we've created at the varnish layer). Note that I don't think this code ever gets a chance to run if retry5xx is set, making the clusters with both of them set kind of odd...

Keeping in mind that the restart count is global throughout varnish for one request: for a normal backend-generated 5xx, the flow would be to go through the vcl_fetch logic first, and if that doesn't restart the error(s), we'd start down the vcl_error logic, with max_restarts (4) being a global limit from varnish itself regardless of which of these is triggering them.

The current settings per cluster/layer (none make a tier distinction) are:

cluster/layerretry5xxretry503
text fe/be01
upload fe/be00
mobile fe00
mobile be14
parsoid fe00
parsoid be14
bits/misc14

Whether or not it makes sense to do any of these retries is complicated. In some cases I think they're justified (especially in the cases where the backend-in-question is behind LVS and will likely get to try a completely-different backend that may not be so broken), but I need to do some more research to discover some of the subtler behaviors going on here at various tiers/layers...

BBlack renamed this task from RFC: Don't retry 503 unless allowed by Retry-After in Varnish to RFC: Re-evaluate varnish-level request-restart behavior on 5xx.Apr 26 2015, 2:48 PM

@BBlack, will Varnish consider Varnish-side backend request timeouts as equivalent to 5xx?

daniel moved this task from Inbox to Under discussion on the TechCom-RFC board.Apr 29 2015, 8:15 PM

I don't think I actually ever got any confirmation about whether merging this change had any intended effect in practice that they were looking for

@BBlack Sorry for the slow reply. Unfortunately, no, it did not have the intended effect. Varnish is still returning its own 503 page, rather than delivering the backend's error message to the client.

You can verify that by pasting the content at https://etherpad.wikimedia.org/p/parsoid-crash into http://parsoid-lb.eqiad.wikimedia.org/_html/ and submitting.

The "fix" I provided was just to match the other backend configs, which I assumed were working. I suspect that isn't the case.

Just to touch base on this issue, in the RFC meeting the basic recap on varnish-layer retry behavior was:

21:27:23 <TimStarling> I think I am leaning towards: retry on connection refused, retry on connection timeout, don't retry any other time
21:27:44 <mark> so don't retry when any work has been done
21:27:47 <gwicke> retry on connection timeout is tricky
21:27:50 <TimStarling> don't retry on server-side abort after request sent (the segfault case)

Although after the bell had rung, we also commented:

21:30:55 <paravoid> (fwiw, I don't agree with the "don't retry any other time" above)
21:31:32 <bblack> (re: paravoid: I tend to agree, we could still do a singular retry of 5xx at the outermost layer, and save a fair chunk of user-visible failings)

What we have today in practice is we already lack 5xx retries for the upload cluster, but we have various sorts of 5xx retries detailed earlier in this ticket for the other clusters, and in general those tend to cascade multiplicatively. For example, the text cluster does a single retry on 503 at every layer. So if there was a persistent 503 result from the applayer for a given request, a single request coming in through esams would re-hit the applayer for a 503 result 8 times (2 attempts from esams-fe -> 4 attempts from esams-be -> 8 attempts from eqiad-be -> app).

While there's some disagreement re: limited/single-retry vs no-retry in these 5xx sorts of cases, I think we're all on the same page that this cascading, multiplying retry is probably undesirable, and shifting from the current practice to a single-retry from the outermost (fe) layer would be a good gradual step towards eventual no-retry regardless, so that we can feel out the pragmatic effects of retry reduction. There's still a little research to do in general on how to clean up our current retry503/retry5xx feature overlap and such first, and verify how timeout/refuse of initial connection is handled and/or can be tuned.

BBlack added a comment.EditedMay 22 2015, 3:47 PM

Not directly as part of sorting out this whole affair, but more as a pragmatic matter, I've turned on retry503 behavior in the upload-fe case (non-cascading single retry at the outermost varnish level). The upload cluster was the only one that consistently caused a stats 503 spike any time we rebooted a machine without depooling first, and that single retry in the frontend eliminated those spikes. The change was in https://gerrit.wikimedia.org/r/#/c/212788/ . I think in the broader sense this does speak to the utility of having varnish restart varnish<->varnish 503's at least once, but the harder part is figuring out how to segregate our behavior based on whether it's an internally-generated 503 from connectfail/timeout, internally-generated 503 from an abort mid-request, or external 503 handed to us by the upstream.

Change 215946 had a related patch set uploaded (by BBlack):
mobile: retry503x1 in fe, no other retries

https://gerrit.wikimedia.org/r/215946

Change 215946 merged by BBlack:
mobile: retry503x1 in fe, no other retries

https://gerrit.wikimedia.org/r/215946

BBlack added a comment.Jun 4 2015, 6:07 PM

The current state of affairs is now this updated table:

cluster/layerretry5xxretry503
text fe01
text be00
upload fe01
upload be00
mobile fe01
mobile be00
parsoid fe00
parsoid be14
bits/misc14

Basically, I've left parsoid/bits/misc alone for now, but for the common 2layer caches (upload/text/mobile), they're all now set to a single retry of only 503s at the outermost (frontend) layer, but no retries of 503 at deeper layers (which could cascade and multiply underneath those frontend retries), and no retry of other 5xx codes at all.

Change 216438 had a related patch set uploaded (by BBlack):
turn text/mobile-be retry503 behavior back on T97206

https://gerrit.wikimedia.org/r/216438

Change 216438 merged by BBlack:
turn text/mobile-be retry503 behavior back on T97206

https://gerrit.wikimedia.org/r/216438

BBlack added a comment.Jun 6 2015, 7:53 PM

The graphs since turning off retry503 on the text/mobile backends have had a lot of odd spikes. I suspect that while the cascading nature of those retries is something we want to avoid, they may be all that's preventing those spikes reaching the users from something going on at a deeper level. The commit above puts them back in place to confirm whether the spikes were related or not.

Change 219096 had a related patch set uploaded (by BBlack):
Disable backend-layer retry503 for text and mobile

https://gerrit.wikimedia.org/r/219096

^ New patch above goes back to the "single 503 retry" behavior from Jun 4 and aligns with the RFC action item in: https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-06-17-21.01.html

Change 219096 merged by BBlack:
Disable backend-layer retry503 for text and mobile

https://gerrit.wikimedia.org/r/219096

GWicke moved this task from Backlog to Under discussion on the RESTBase board.Jun 29 2015, 5:27 PM
Restricted Application added a subscriber: Matanya. · View Herald TranscriptJun 29 2015, 5:27 PM

TechCom basically approves, details will be worked out in gerrit review. @tstarling will write down the recommended approach to this and T97204: RFC: Request timeouts and retries on-wiki in Category:Development_guidelines.

BBlack moved this task from Up Next to Done on the Traffic board.Sep 22 2015, 2:00 PM
GWicke added a comment.EditedOct 8 2015, 8:46 PM

We just found an instance where a known external client retrying up to five times caused 72 failing parsoid requests, mostly by successive layers retrying. Most of those retries seem to come from the Varnish layers retrying on top of each other. RESTBase for example saw 36 requests, an amplification of almost 8 from the client's five retries. The client is in Europe, so this sounds consistent with each of three Varnish layers retrying once.

@BBlack, is there a way to limit retries to one Varnish layer only?

(As an aside, I still think that we ideally wouldn't have any Varnish-level retries on 5xx that isn't 503 with explicit retry-after, but I understand that this isn't possible with the current state of MW.)

GWicke added a comment.EditedOct 8 2015, 11:04 PM

I just did an experiment with a page that reliably fails in Parsoid:

curl https://ia.wikibooks.org/api/rest_v1/page/html/Interlexico_c/4697

This timed out after four (possibly two) minutes returning a 504 (as returned by RB).

The weird thing is that RESTBase continues to receive retries for this request, despite it having timed out already. Each of those curl requests (one from labs & the office each) have produced 16 requests in the RESTBase logs. Each of those is handled at a different RESTBase host and has a different request ID, so this retrying is happening somewhere between the curl client and RESTBase.

RESTBase will time out each of those requests after four minutes, after having tried twice to get a response from Parsoid with a timeout of two minutes. The retries in the logs are spaced out about once every two minutes, so it seems likely that higher layers have a shorter timeout than the four minutes in RESTBase.

Result of searching for Interlexico_c in https://logstash.wikimedia.org/#/dashboard/elasticsearch/restbase:

BBlack added a comment.Oct 9 2015, 1:22 PM

Unless we have a bug, we already have the behavior you're asking for, on the primary clusters. Varnish only retries 503 (not other 5xx), and only does so once at a single layer (the frontend layer). However, we never made those changes to the cache_parsoid cluster, as it's fragile and different with lookups that loop on itself, and was expected to be deprecated or removed by now.

Are you sure that the cache_parsoid varnish cluster is not (perhaps indirectly) involve in the behavior you're seeing?

GWicke added a comment.EditedOct 9 2015, 3:32 PM

Are you sure that the cache_parsoid varnish cluster is not (perhaps indirectly) involve in the behavior you're seeing?

Yes, as indicated by the domain used, and reflected in the XFF headers. These requests are hitting the text varnishes, which in turn are hitting restbase.svc.eqiad directly.

I am pretty sure the retry behavior is a bug. It is possible that it is a bug that's specific to the way the RESTBase route or backend is configured in VCL.

Krinkle updated the task description. (Show Details)Oct 28 2015, 8:29 PM
BBlack closed this task as Resolved.Apr 20 2016, 1:34 PM

We haven't seen any movement on this ticket in months, and AFAIK varnishes are doing the best pragmatic thing we could figure out back when we last discussed this, which was that the entire Traffic infrastructure as a whole (all of nginx/varnishes involved in any request) only do at most 1 global retry (two total requests) for any single inbound request that reaches the front edge of our infrastructure -- there is no multiplication of retries through our traffic layers anymore.

In pseudo-code terms, the very front edge varnish basically does:

if (response.code == 503 && request.restarts == 0):
   request.restarts++;
   request.restart();
}

No other layers or varnishes ever restart/retry requests. If there's ever any evidence we're doing differently, please open a separate bug, as that's just a bug rather than a design discussion.