At high traffic rates, any long timeout on a pretty common operation can grind a whole cluster to a halt if the operation is too slow.
We've lately seen several production incidents where such timeouts were problematic, and we need a better way to manage both connection and overall timeouts in high-traffic situations.
Connection timeout
When performing an HTTP request to a destination that is unreachable (that is, doesn't respond to TCP connection requests), even a 1 second connection timeout can cause large scale issues, in particular when we use RestBagOfStuff for very common operations - waiting for connections can result in all of the production workers becoming busy waiting for the connection timeout to expire. Ditto for the actual timeout.
Request timeout
Given how php works, every external request not finishing keeps a php worker busy. Any external request lasting longer than a relatively aggressive timeout can easily pile up and consume all the available PHP workers in a short span of time. Being able to put a global cap on the request timeout is a coarse-grained way to reduce the risk of such domino effects, where slowness of a service causes another one to grind to a halt.
Ideally, we should just be able to set a total concurrency limit on how many request we make to an external service, so that we also avoid overwhelming it once it's slowing down.
The status quo
There are several way of making an http request from MediaWiki:
- The Http class (deprecated). Supports setting a default timeout via wgHTTPConnectTimeout and wgHTTPRequestTimeout when used via Http::createMultiClient
- MultiHttpClient, does not honour wgHTTPConnectTimeout and wgHTTPRequestTimeout
- Instantiating a subclass of MWHttpRequest via HttpRequestFactory. These should all honour the global variables. There are various ways of instantiating these classes, some of which deprecated, but AIUI the mechanism is the same
What I'd like to see changed
I would like first of all to add the honouring of the globals to MultiHttpClient, and to add two more globals, controlling the *maximum* connect/request timeout we accept globally, so that if the code sets a longer timeout than this global value, this is used instead. In pseudocode:
timeout = min(user_timeout, max_global_timeout)
This would allow us to put a global cap on the timeouts that we deem acceptable in production, and also allow us to cut down all the timeouts in an emergency. This would've been extremely handy during the past week's incidents.
It would also be great if we were able to set concurrency limits to the number of http requests we can perform at the same time, but I suspect that's much harder to implement, and we might decide to go use an external proxy system to achieve that.