To avoid retries amplifying overload situations, we should adhere to the following rules in a client-server pair:
- Server request timeouts are set (slightly) shorter than client timeouts.
- When reaching the request timeout in a server, all request-associated resources are released and a response with a 503 status code is sent. If a retry is permissible, the retry delay is specified with a Retry-After header, like this: Retry-After: 120.
- Clients follow HTTP semantics when receiving a response with status 503: It is only legal to retry if Retry-After is specified, respecting the delay.
With multiple layered services, this works out to a staggering of timeouts, with the lowest level using the shortest possible timeouts. By waiting for the server response, clients can check the status for a 503 response, and avoid retrying altogether.
We should also aim to not expose any API end points with timeouts longer than 60 seconds. This won't be possible immediately, but we should eliminate exceptions step by step. Most end points should have timeouts significantly below 60s, with large tasks performed with paging or other client-side iteration.
With tight backend timeouts clients can detect hanging backends by the absence of a timely 503 response before reaching the client timeout. In most situations the percentage of unhealthy nodes is low, and a retried request has a high chance of being routed to a healthy backend & be successfully processed. Retries after client timeouts should:
- be limited to 2 retries,
- use increasing timeouts with a fuzz factor.
Assuming a low percentage of hanging backends, two retries have a very high probability of routing one request to a healthy backend. Increasing timeouts cater to overload situations in the backend, which might cause timeouts to not trigger in a timely manner. They also delay retries beyond a point where temporary backend issues might be resolved. By adding some randomness to the timeout, a 'thundering herd' scenario can be avoided. An additional delay before retrying can further help, although given the limited time available for processing the request overall it might be preferable to work with timeouts only.