Page MenuHomePhabricator

RFC: Request timeouts and retries
Closed, ResolvedPublic

Description

To avoid retries amplifying overload situations, we should adhere to the following rules in a client-server pair:

  1. Server request timeouts are set (slightly) shorter than client timeouts.
  2. When reaching the request timeout in a server, all request-associated resources are released and a response with a 503 status code is sent. If a retry is permissible, the retry delay is specified with a Retry-After header, like this: Retry-After: 120.
  3. Clients follow HTTP semantics when receiving a response with status 503: It is only legal to retry if Retry-After is specified, respecting the delay.

With multiple layered services, this works out to a staggering of timeouts, with the lowest level using the shortest possible timeouts. By waiting for the server response, clients can check the status for a 503 response, and avoid retrying altogether.

We should also aim to not expose any API end points with timeouts longer than 60 seconds. This won't be possible immediately, but we should eliminate exceptions step by step. Most end points should have timeouts significantly below 60s, with large tasks performed with paging or other client-side iteration.

With tight backend timeouts clients can detect hanging backends by the absence of a timely 503 response before reaching the client timeout. In most situations the percentage of unhealthy nodes is low, and a retried request has a high chance of being routed to a healthy backend & be successfully processed. Retries after client timeouts should:

  1. be limited to 2 retries,
  2. use increasing timeouts with a fuzz factor.

Assuming a low percentage of hanging backends, two retries have a very high probability of routing one request to a healthy backend. Increasing timeouts cater to overload situations in the backend, which might cause timeouts to not trigger in a timely manner. They also delay retries beyond a point where temporary backend issues might be resolved. By adding some randomness to the timeout, a 'thundering herd' scenario can be avoided. An additional delay before retrying can further help, although given the limited time available for processing the request overall it might be preferable to work with timeouts only.

Event Timeline

GWicke raised the priority of this task from to High.
GWicke updated the task description. (Show Details)
GWicke added subscribers: GWicke, ori.
GWicke updated the task description. (Show Details)
GWicke updated the task description. (Show Details)
GWicke updated the task description. (Show Details)
GWicke renamed this task from Make sure timeouts are staggered, and there are no retries on timeout from a lower level to RFC: Request timeouts and retries.Apr 25 2015, 1:52 AM
GWicke added a project: TechCom-RFC.
GWicke edited subscribers, added: tstarling, BBlack, Joe and 3 others; removed: Aklapper.

I don't actually think Retry-After has the right semantics here. I think it applies globally to the whole service, and thus if a server sent a client (both being services within our architecture) a value of 120 here it would mean "this whole service shouldn't be used for 120s", not just the one request/URL in question. Even if we defined a different header of our own to have the internal semantics that the retry delay applied to a specific resource only, would the consuming service really want to maintain a table of "if more requests come in that require fetching resource X from service Y, we can't do that until T"?

I think a better overall pattern (at least, as a starting point for thinking these things through, with possible exceptions) would be that if the request to some backend server times out or returns 5xx, the consuming client should immediately return 5xx as well.

@BBlack, I haven't seen anything explicitly stating that Retry-After is global to a service. The only thing hinting at that is afaik 503 meaning Service unavailable.

I think from a practical perspective a per-request interpretation makes a lot more sense. It is simpler to implement, and does not require guessing what constitutes a 'service'. I'm pretty sure that in practice basically all implementations of Retry-After follow the per-request model.

I think a better overall pattern (at least, as a starting point for thinking these things through, with possible exceptions) would be that if the request to some backend server times out or returns 5xx, the consuming client should immediately return 5xx as well.

Agreed. This is in fact already the case, with the only exception being Varnish. The issue though is that commonly client timeouts are shorter than backend timeouts, so those 5xx responses are not necessarily received.

@BBlack, I haven't seen anything explicitly stating that Retry-After is global to a service. The only thing hinting at that is afaik 503 meaning Service unavailable.

I think from a practical perspective a per-request interpretation makes a lot more sense. It is simpler to implement, and does not require guessing what constitutes a 'service'. I'm pretty sure that in practice basically all implementations of Retry-After follow the per-request model.

All the complications of defining a virtual "service" as an entity orthogonal to the hierarchy of HTTP server names and such aside, I think the wording of the relevant RFCs (Older, Newer) is pretty clear that 503 and Retry-After apply to the entire "server" from an HTTP perspective, not to a specific URL, as it talks about "server overload" and such. If we want something else with per-resource semantics, we can define our own internal headers for that internally.

Honestly, I don't think it makes sense to try to define a Retry-After-like timeout at all for these kinds of failures, especially in the context of service<->service within our architecture. It's not like the failing server ever has any realistic idea of when, in the future, the condition causing the error will be fixed. Ideally we just don't retry things at any deeper level within our architecture at all, and then *maybe* retry things and/or set an outer-layer Retry-After at the entry point to our whole stack (up in varnish, facing the user our outside consuming service), but cautiously (which we can discuss further down in that ticket). It's not like we can rely on anyone actually obeying a Retry-After to the outside world in the general case, though.

This is in fact already the case

^ It certainly wasn't up until recently. It might be now with some recent changes?

Perhaps confusion on that last point above is about timeouts vs 503s. I think reactions to hard timeouts and 503s need to have the same behaviors from a consuming service's perspective.

All the complications of defining a virtual "service" as an entity orthogonal to the hierarchy of HTTP server names and such aside, I think the wording of the relevant RFCs (Older, Newer) is pretty clear that 503 and Retry-After apply to the entire "server" from an HTTP perspective, not to a specific URL, as it talks about "server overload" and such. If we want something else with per-resource semantics, we can define our own internal headers for that internally.

Yeah, but 'server' was probably a simpler concept back then ;) With LVS, backend services etc, we could even be obeying the RFC's should despite not having to.

Honestly, I don't think it makes sense to try to define a Retry-After-like timeout at all for these kinds of failures,

Yes, can't think of many use cases either, except maybe during transient events like a service restart.

This is in fact already the case

^ It certainly wasn't up until recently. It might be now with some recent changes?

No, this has always been the case in Parsoid and RESTBase. That case didn't trigger, as the client timed out before any server (5xx) response was received.

I think reactions to hard timeouts and 503s need to have the same behaviors from a consuming service's perspective.

This would mean that we'd never retry at all, even if we timed out already while getting a TCP connection, or waited past the server's timeout.

The usual policy for internal network services, e.g. MW contacting MySQL, search, Redis, Swift, etc. has been to not retry at all. I think this is a good policy for cases when you can pass the error upstream and have no special availability requirements.

Apparently the reason Parsoid needs to retry is because it does not have the ability to fail gracefully when the API fails. According to Subbu on https://gerrit.wikimedia.org/r/#/c/206362/ , permanent API failures will "leave holes (templates, extensions, citations) in HTML", which I gather means that it will permanently pollute varnish cache and RESTBase with corrupted output. I think that's an issue that needs to be addressed in Parsoid.

When running from the job queue, Parsoid is a special case in that it has relaxed latency requirements, and so it could trade latency for availability. Parsoid workers could sleep indefinitely while they wait for API downtime to be rectified. But it would still need to have code to handle permanent failures gracefully since some Parsoid requests have upstream clients waiting for them and need to be handled promptly.

Server request timeouts are set (slightly) shorter than client timeouts.

That is probably a good idea, although with the following caveats:

  • We don't always have the ability to set a server-side timeout, it depends on the platform. Also, on some platforms, a server timeout is implemented by setting a flag which is checked later, sometimes seconds later. That is the case with HHVM.
  • We sometimes set very long timeouts on write operations, to protect database consistency. These timeouts can be longer than the client timeout. And we don't always have the ability to distinguish between write requests and read requests.

When reaching the request timeout in a server, all request-associated resources are released and a response with a 503 status code is sent. If a retry is permissible, the retry delay is specified with a Retry-After header, like this: Retry-After: 120.

Yes, I suppose it makes sense to send Retry-After in an overload condition. I don't think you can send a Retry-After header unconditionally on an HHVM timeout, since they are usually permanent.

Note that MW already sends a Retry-After header to support the maxlag feature.

We should also aim to not expose any API end points with timeouts longer than 60 seconds. This won't be possible immediately, but we should eliminate exceptions step by step. Most end points should have timeouts significantly below 60s, with large tasks performed with paging or other client-side iteration.

Hitting a server timeout is inefficient, since the work is wasted. Even if you propagate the error all the way back to the end user, there's a chance they will hit the refresh button, and then that work will be wasted all over again. In an overload, efficiency can fall to zero since all CPU time is used to perform work that will be discarded. So if demand is 120% of capacity, you end up serving 0%. If you have an infinite timeout but with a concurrency limit, 120% demand would give 100% valid output and 20% error messages. So there is a strong case for tuning server-side timeouts so that they are almost never hit.

Why have a server-side timeout at all? Well, a server-side timeout allows you to limit the worst case performance per request, which has some benefit for stability in the case of a low request rate of extremely expensive requests. You can look at this in terms of allocation of CPU resources among competing clients. If one set of clients demands 1% of CPU, and another demands ∞% of CPU, it makes sense to prioritise the cheap clients over the expensive clients. A timeout is a very crude way to achieve this prioritisation. Say if you have two clients, sending requests at an equal rate, but responses to client A take 1s and responses to client B take as long as the server side timeout, say 99s. Then you have allocated 1% of CPU to client A and 99% of CPU to client B. With an infinite timeout, you have allocated 0% to A and 100% to B. So it is a little bit better with a finite timeout.

PoolCounter is an attempt at less crude prioritisation of competing clients. It limits concurrency instead of execution time, thus avoiding wasted work, and it identifies sets of competing clients in a flexible, application-dependent way. It uses a fixed concurrency limit per application-defined pool, which effectively means a fixed proportion of server resources per application-defined pool.

When concurrency limit is exceeded, you have detected an overload and so you could unconditionally send a Retry-After header in response.

Another reason to have a finite server-side timeout is psychology. After waiting for a minute, the human wants an explanation. Ideally they want a progress bar, but failing that, we can at least send an apologetic error message. This does not create a requirement to abort backend processing. If, after client connection shutdown, caches can still be populated, or writes can still be done, it may make sense to continue the request. If aborting the client request necessarily requires throwing work away, then it makes sense to abort backend requests or to arrange for them to time out before we send our apology.

With tight backend timeouts clients can detect hanging backends by the absence of a timely 503 response before reaching the client timeout. In most situations the percentage of unhealthy nodes is low, and a retried request has a high chance of being routed to a healthy backend & be successfully processed.

Do you have a model for a "hanging backend"?

If the server is simply overloaded, then LVS will stop sending new requests to it when the concurrency is elevated above the cluster mean, which will be a small number if the cluster is not overloaded. If you have a server-side timeout, then as each request times out, LVS will send the server another request. So the latency burden is spread out over more requests, but it is increased due to the overhead of wasted work. An infinite timeout minimises the total latency impact.

Apparently the reason Parsoid needs to retry is because it does not have the ability to fail gracefully when the API fails. According to Subbu on https://gerrit.wikimedia.org/r/#/c/206362/ , permanent API failures will "leave holes (templates, extensions, citations) in HTML", which I gather means that it will permanently pollute varnish cache and RESTBase with corrupted output. I think that's an issue that needs to be addressed in Parsoid.

This is primarily an issue for editing which has more stringent requirements so we don't corrupt wikitext on save. The biggest problem here are unbalanced templates. So, if the template expansion for a template includes an opening/closing tag, then a failure to expand that will then change the DOM structure, and hence template encapsulation and it is unclear (in a general way) what edits will do to serializability of those structures. There is nothing intelligent Parsoid can do to fix that in all scenarios. The safest thing to do is to disable editing on such pages. We could do progressively smarter things for failures in different scenarios, but I am not convinced it is worth the complexity beyond disabling editing on those pages. That is something we could consider.

Do you have a model for a "hanging backend"?

I don't have a comprehensive model, but can give you examples we encountered in the past.

One would be a DNS change not being picked up by pybal, which caused LVS to continue sending requests to an IP that wasn't there any more. Some percentage of requests would fail to even set up a TCP connection, but then succeed when retried on reaching the timeout. This incident prompted us to set up a separate TCP connect timeout for a timely retry when no connection can be established at all.

Another is individual HHVM nodes exhausting their threads, or being close to OOM. This was fairly common in the early days of HHVM, but still occurs occasionally. Load balancing is not perfect, and the PHP API exposes fairly expensive end points that can tie up threads for a long time.

Hitting a server timeout is inefficient, since the work is wasted.

It is true that work is wasted, but it is easy to waste a lot more time without timeouts. Clients don't wait for servers indefinitely. Browsers typically time out the connection after 300 seconds, and users will often move on before then. A request that times out after 12 minutes (old Zend timeout) thus ends up wasting 12 minutes of CPU time without making any progress, in addition to inviting a DOS. An operation limited to 60 seconds would have to be retried 12 times to waste the same amount of CPU time, and such requests are easier to limit based on rates.

A lack of tight timeouts also makes it hard for clients to provide reasonable latency guarantees. Clients are basically forced to set shorter timeouts themselves, which again wastes more server-side resources than necessary when the request continues to be processed when the client is already gone.

Tight timeouts also encourage us to bound processing times in the implementation by setting reasonable limits, designing for more iterative processing, or optimizing. We already have logs for 5xx responses, and can use this to systematically track down end points that take too long.

Apparently the reason Parsoid needs to retry is because it does not have the ability to fail gracefully when the API fails.

There are basically two options:

  1. fail the entire request, or
  2. don't fail the entire request, but represent the failed content in unexpanded state, which round-trips it but does not look as expected.

Currently Parsoid follows 2) with some amount of retrying (1 retry at present). It would be fairly straightforward to switch to 1) and fail the entire request on any error. The disadvantage though is that failed API requests are most likely on huge pages with > 100 transclusions. If a single transient request failure causes the full parse to be retried in the client or outer API layer, then all those other requests will be retried too. This can make the problem significantly worse in overload situations, when individual failed requests are more likely to fail.

Also, on some platforms, a server timeout is implemented by setting a flag which is checked later, sometimes seconds later. That is the case with HHVM.

We could account for that by allowing sufficient margin between the server-side timeout and the client timeout.

We sometimes set very long timeouts on write operations, to protect database consistency.

There are certainly cases where it makes sense to only time out the client connection, while still finishing an operation server-side. IIRC there is support for that in PHP as well.

Apparently the reason Parsoid needs to retry is because it does not have the ability to fail gracefully when the API fails.

There are basically two options:

  1. fail the entire request, or
  2. don't fail the entire request, but represent the failed content in unexpanded state, which round-trips it but does not look as expected.

We can do roundtripping, but it is unclear how editing will be affected since DOM structure can change. For many templates, we can just alienate the individual transclusion, but in the general case of unbalanced templates, those effects will percolate elsewhere. So, the simplest solution is to disable editing which may not be a bad solution given that these scenarios are expected to be rare. And, perhaps a mechanism to trigger regeneration at a later time.

I might have understated the ability of our current template encapsulation to handle this ... because of the changed DOM structure, edit experience will degrade and WYSIWYG experience will be broken on these pages but selser might still handle it properly as long as we are able to preserve the right source offsets and add special markup on the failed encapsulation. Moving further discussion to T97649.

Do you have a model for a "hanging backend"?

I don't have a comprehensive model, but can give you examples we encountered in the past.

One would be a DNS change not being picked up by pybal, which caused LVS to continue sending requests to an IP that wasn't there any more. Some percentage of requests would fail to even set up a TCP connection, but then succeed when retried on reaching the timeout. This incident prompted us to set up a separate TCP connect timeout for a timely retry when no connection can be established at all.

Yes, fine, a client can retry immediately after a connect timeout, and connect timeouts can be short.

Another is individual HHVM nodes exhausting their threads, or being close to OOM. This was fairly common in the early days of HHVM, but still occurs occasionally.

That just sounds like an overload. I think my previous arguments apply.

Load balancing is not perfect, and the PHP API exposes fairly expensive end points that can tie up threads for a long time.

LVS deprioritises a node before it even sends the HTTP request, so it shouldn't be possible for a node to have a high priority while it is handling more than its fair share of concurrent HTTP requests. If there is a long-running API request, then the server will be deprioritised for a long time. If you assume that unexpectedly high latency is the result of CPU time sharing, and each that active request uses a fair share of CPU (e.g. a single thread), it should be near enough to perfect.

Server-wide issues such as swapping and broken disk drives, and load generated outside of LVS, such as via puppet, could have the effect of a node getting more traffic than is desirable. But the client has a very limited view of the state of the cluster, and can easily make things worse by wasting work. You could abort and retry a request in the event of swapping, but what happens when all nodes are swapping? Pybal, on the other hand, can monitor nodes for unusual slowness and can depool them individually, it knows that it is not meant to depool the whole cluster in the event of a cluster-wide overload.

It is true that work is wasted, but it is easy to waste a lot more time without timeouts. Clients don't wait for servers indefinitely. Browsers typically time out the connection after 300 seconds, and users will often move on before then. A request that times out after 12 minutes (old Zend timeout) thus ends up wasting 12 minutes of CPU time without making any progress, in addition to inviting a DOS. An operation limited to 60 seconds would have to be retried 12 times to waste the same amount of CPU time, and such requests are easier to limit based on rates.

Like I said, server timeouts are a very crude approach to DoS prevention. They are better than nothing -- instead of allocating 100% of CPU to the DoS, maybe you can allocate 99%. Per-client concurrency control is a better method, but it is not always feasible.

Tight timeouts also encourage us to bound processing times in the implementation by setting reasonable limits, designing for more iterative processing, or optimizing. We already have logs for 5xx responses, and can use this to systematically track down end points that take too long.

It seems irrational. Surely we can set performance goals without deliberately hurting ourselves. We can (and in several cases do) log requests that take more than a specified amount of time, e.g. fluorine:/a/mw-log/slow-parse.log

I'm not arguing for infinite timeouts, I'm just saying that I don't agree with all the reasons you set out here for reducing them, so we would probably each decide on different numbers. Also, I don't think the case for client retries (apart from connect retries) is general, I think it is specific to Parsoid.

Interesting discussion. Thank you everyone. So, I think there is a useful distinction that has emerged here. I think internal services like MySQL, Redis for which response times are "somewhat predictable" and internal services like MW API (with endpoints like parse, expandtemplates) where the response times are "not predictable".

By that, I mean, I can construct queries to the mediawiki API which can take longer than any timeout value that you can set. However, this is not true fo MySQL for example, i.e. with a properly tuned database, the queries that the application makes to the db are expected to fit a performance profile, and if it falls outside that profile, you actively tune the db, modify the query, whatever. This is not true for the mediawiki API. When a query takes time beyond a timeout value, that is not a problem with the API necessarily, it is just the nature of what the API exposes.

For any internal users of mediawiki API (like Parsoid), it is clear that if you want to guarantee successful parsing, you need to be able to negotiate a timeout value beyond whatever value has been set. As I argued in the other email thread and what Tim is arguing here as well and I think Gabriel agrees, with the reasonable assumption that transient failures with the mediawiki API are going to be rare, you can pick an initial timeout value that is high enough that for most API requests, you will not need a retry. But, no matter what, you still need ability to retry -- that is unavoidable. Plus, you cannot cap # of retries. In ideal scenarios, 1 (or at most 2) retries are sufficient. You effectively cap # of retries because (a) Parsoid kills requests taking longer than time T (b) exponential (or whatever other mechanism) backoff for retry timeouts ensures that you are more likely to succeed, and beyond a point, this timeout value will exceed T (c) by picking appropriate values for T, initial timeout value, and the exponential backoff mechanism, you can effectively cap retries at 1, 2, or 3 or whatever you want to enforce (and also tune it without hardcoding a specific value).

So, I think Tim is right that, as of now, this need for retries is limited to Parsoid because of its unique requirements and because of its reliance on the mediawiki API without a way to guarantee specific performance goals.

One way to guarantee that is to cap complexity of pages, templates, and extensions (for whatever complexity measure we can come up with) which implicitly happens anyway because of timeout values. So, if the mediawiki API sets a cpu timeout value of T_API, and if Parsoid sets a timeout value bigger than T_API, it is clear that the only reason to retry is for transient errors => Parsoid can only handle those kind of pathological pages where any mediawiki API requests it makes complete within time T_API.

The other observation here is that on the Parsoid end, there is no way to distinguish between transient failures vs. failures due to API congestion (hopefully due to transient loads). Insofaras Parsoid needs to retry, it will retry which could make matters worse in the latter case. But because of exponential backout timeout values, we effectively do enforce a retry cap. Is that sufficient to guarantee that runaway load can be avoided because of retries?

So, to summarize, here are some observations and questions that come up:

  • It is important to pick a suitable timeout value T_API for the mediawiki API that seems reasonable (right now, this is 290 seconds)
  • It is important for Parsoid also to pick a suitable initial timeout value T_Parsoid that guarantees most requests don't require retries (right now this is 30/40 secs)
  • If T_Parsoid > T_API, retries are not going to be helpful. To deal with transient errors, 1 retry is sufficient.
  • If T_Parsoid < T_API, Parsoid will benefit with retries as long as the ratcheting timeout values remain < T_API. # of retries are effectively capped by this constraint and the max-request-time in Parsoid (right now this is 5 mins. Exponential backoff timeout is 30*3^n where n is the retry count. So, we can do 2 retries at most).
  • Are there are services in the mediawiki ecosystem whose performance characteristics are more like mediawiki API (i.e .determined by client queries) and less like MySQL?
  • Are there are other service-pairs now or in the future that will be like the mediawiki API (server) and Parsoid?
  • Is this mechanism sufficient to prevent the runaway load scenarios that we encountered?

There is one other thing that follows ... If you want to pick low concurrency in a client (Parsoid), you also need to pick a lower initial timeout value so there is greater opportunity to retry and still have other waiting requests complete successfully. This relationship is not linear necessarily, but, this rough observation holds. Barring a performance model that can model this relationship (which we don't have), picking the right values may be a matter of some empirical testing.

this need for retries is limited to Parsoid

From an overload prevention strategy perspective, retries further down the stack make sense whenever the alternative would be higher-level retries with higher associated overall costs. This is true whenever there is a need for reliability, and the potentially-retried request represents a small portion of the overall request processing, in terms of IO, CPU or service requests.

Overall, what we are talking about here is establishing response time SLAs between services. A service relying on another service that does not keep to an SLA has no choice but to resort to client timeouts in order to meet its own response time target. Establishing the right timeouts and retry policies is tricky if the other end does not make any reasonable response time guarantees.

this need for retries is limited to Parsoid

From an overload prevention strategy perspective, retries further down the stack make sense whenever the alternative would be higher-level retries with higher associated overall costs. This is true whenever there is a need for reliability, and the potentially-retried request represents a small portion of the overall request processing, in terms of IO, CPU or service requests.

My beef with this perspective, though, it's not a matter of "alternative would be higher-level retries". Most likely we're going to still have higher-level retries regardless, to protect the user's experience from random lower-layer intermittent issues. It's not an alternative to them, it's in addition to them, which means if both are in effect (as would be the case when your lower-level retries still result in an ultimate failure), they're going to multiply with each other and help cause a storm of error activity at various layers of the stack.

Any conclusions to draw from this discussion? Do we need to explicitly invite certain people to weigh in?

@ArielGlenn, we discussed this last week at the IRC meeting. There was no clear consensus yet, but also no concrete proposal. The summary has one action item:

  • @mark to file a bug: let's start implementing HTTP error codes more closely to the RFC

TechCom basically approves, details will come from gerrit review. @tstarling will write down the recommended approach to this and T97206: RFC: Re-evaluate varnish-level request-restart behavior on 5xx on-wiki in Category:Development guidelines.

Change 231197 had a related patch set uploaded (by Ori.livneh):
Set maximum execution time to 60 seconds

https://gerrit.wikimedia.org/r/231197

Change 231197 merged by Ori.livneh:
Set maximum execution time to 60 seconds

https://gerrit.wikimedia.org/r/231197

This follow-up task from an incident report has not been updated recently. If it is no longer valid, please add a comment explaining why. If it is still valid, please prioritize it appropriately relative to your other work. If you have any questions, feel free to ask me (Greg Grossmeier).