Page MenuHomePhabricator

Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki
Closed, ResolvedPublic

Description

At high traffic rates, any long timeout on a pretty common operation can grind a whole cluster to a halt if the operation is too slow.

We've lately seen several production incidents where such timeouts were problematic, and we need a better way to manage both connection and overall timeouts in high-traffic situations.

Connection timeout

When performing an HTTP request to a destination that is unreachable (that is, doesn't respond to TCP connection requests), even a 1 second connection timeout can cause large scale issues, in particular when we use RestBagOfStuff for very common operations - waiting for connections can result in all of the production workers becoming busy waiting for the connection timeout to expire. Ditto for the actual timeout.

Request timeout

Given how php works, every external request not finishing keeps a php worker busy. Any external request lasting longer than a relatively aggressive timeout can easily pile up and consume all the available PHP workers in a short span of time. Being able to put a global cap on the request timeout is a coarse-grained way to reduce the risk of such domino effects, where slowness of a service causes another one to grind to a halt.

Ideally, we should just be able to set a total concurrency limit on how many request we make to an external service, so that we also avoid overwhelming it once it's slowing down.

The status quo

There are several way of making an http request from MediaWiki:

  • The Http class (deprecated). Supports setting a default timeout via wgHTTPConnectTimeout and wgHTTPRequestTimeout when used via Http::createMultiClient
  • MultiHttpClient, does not honour wgHTTPConnectTimeout and wgHTTPRequestTimeout
  • Instantiating a subclass of MWHttpRequest via HttpRequestFactory. These should all honour the global variables. There are various ways of instantiating these classes, some of which deprecated, but AIUI the mechanism is the same

What I'd like to see changed
I would like first of all to add the honouring of the globals to MultiHttpClient, and to add two more globals, controlling the *maximum* connect/request timeout we accept globally, so that if the code sets a longer timeout than this global value, this is used instead. In pseudocode:

timeout = min(user_timeout, max_global_timeout)

This would allow us to put a global cap on the timeouts that we deem acceptable in production, and also allow us to cut down all the timeouts in an emergency. This would've been extremely handy during the past week's incidents.

It would also be great if we were able to set concurrency limits to the number of http requests we can perform at the same time, but I suspect that's much harder to implement, and we might decide to go use an external proxy system to achieve that.

Details

ProjectBranchLines +/-Subject
operations/mediawiki-configmaster+15 -9
mediawiki/coremaster+3 -6
mediawiki/extensions/TEImaster+5 -2
mediawiki/coremaster+8 -14
operations/mediawiki-configmaster+12 -4
mediawiki/extensions/Translatemaster+3 -1
mediawiki/extensions/CirrusSearchmaster+1 -1
mediawiki/extensions/Flowmaster+3 -2
mediawiki/extensions/Collectionmaster+6 -4
mediawiki/extensions/VEForAllmaster+2 -2
mediawiki/extensions/EventBusmaster+1 -1
mediawiki/extensions/MathSearchmaster+5 -1
mediawiki/extensions/Mathmaster+3 -1
mediawiki/extensions/VisualEditormaster+2 -1
mediawiki/extensions/Echomaster+2 -1
mediawiki/extensions/ContentTranslationmaster+5 -1
mediawiki/extensions/FileAnnotationsmaster+9 -5
mediawiki/coremaster+436 -39
Show related patches Customize query in gerrit

Event Timeline

WDoranWMF moved this task from Inbox to Triage Meeting Inbox on the Platform Engineering board.
WDoranWMF added a subscriber: WDoranWMF.

Moving this to feature requests for PMs to review, we'll need to investigate what appropriate limits would be and how they should be tailored to requesters needs in order to move forward.

Moving this to feature requests for PMs to review, we'll need to investigate what appropriate limits would be and how they should be tailored to requesters needs in order to move forward.

Hi @WDoranWMF sorry I didn't see your message.

I don't think the limits should be decided in this task, those will be determined separately, probably based on operational parameters. This task is about implementing two global variables to be able to cap the connect timeout and the request timeout for all HTTP calls from MediaWiki.

AMooney raised the priority of this task from High to Unbreak Now!.May 4 2020, 5:03 PM
AMooney lowered the priority of this task from Unbreak Now! to High.May 6 2020, 5:06 PM

There is Http::createMultiClient(), which respects $wgHTTPConnectTimeout and $wgHTTPTimeout, but it was deprecated in 1.34 in favour of direct construction. MultiHttpClient is a pseudo-library (in includes/libs) and so if it is created by direct construction, it should not be accessing global configuration. I see in codesearch that a lot of extensions are creating a MultiHttpClient with empty options, which means it will use the catastrophically bad defaults of 10s connect timeout and 900s (15 min) request timeout.

To implement this task, construction of MultiHttpClient will be moved back to a factory, and all extensions will be migrated to use the factory. The factory will derive the correct timeout from configuration and call parameters as described in the task description.

For consistency and for conformance to modern code standards, timeouts should also be injected from the HttpRequestFactory service to MWHttpRequest. Currently MWHttpRequest accesses the configuration globals directly.

Given the fact that the MultiHttpClient factory will have the same timeout configuraton logic as HttpRequestFactory, it would make sense to me to have HttpRequestFactory be the factory for MultiHttpClient. So we would have say HttpRequestFactory::getMultiClient().

I'd like to know if @BPirkle (the assigned reviewer for this work) approves of this design.

@tstarling , that sounds good to me.

There will be no technical barrier to directly constructing MultiHttpClient, right? We'll need to be cognizant of this in future code reviews, and redirect anyone who does direct construction without a compelling reason to the factory convention.

Should we also change the 10 second connection timeout and 900 second request timeout to something less impactful? If any particular piece of code has a legitimate reason for longer timeouts, that can be handled on a case-by-case basis. Maybe that gets done under a separate task (per @Joe comment it may be better to discuss actual timeout values elsewhere), but I mention that possibility for completeness.

I notice that the 900 seconds used to be 300 seconds (still very long for most purposes), but was increased in https://phabricator.wikimedia.org/T226979. For full disclosure, I +2'd that change: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/520135/

@tstarling , that sounds good to me.

There will be no technical barrier to directly constructing MultiHttpClient, right? We'll need to be cognizant of this in future code reviews, and redirect anyone who does direct construction without a compelling reason to the factory convention.

Maybe we can disallow it in phpcs.

Should we also change the 10 second connection timeout and 900 second request timeout to something less impactful? If any particular piece of code has a legitimate reason for longer timeouts, that can be handled on a case-by-case basis. Maybe that gets done under a separate task (per @Joe comment it may be better to discuss actual timeout values elsewhere), but I mention that possibility for completeness.

I notice that the 900 seconds used to be 300 seconds (still very long for most purposes), but was increased in https://phabricator.wikimedia.org/T226979. For full disclosure, I +2'd that change: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/520135/

I imagine that the new global ($wgHTTPMaxRequestTimeout) will only be set for web requests. For CLI and jobs it doesn't make so much sense.

That task reminds me that I haven't considered per-request options. MultiHttpClient::run() and runMulti() take their own connTimeout and reqTimeout options which override the constructor defaults. Respecting $wgHTTPMaxRequestTimeout when those parameters are passed will require passing the max timeout to the MultiHttpClient constructor for enforcement in runMulti(). It's not enough to implement the logic in HttpRequestFactory.

I'll write up an alternative way to fix T226979 on the task.

That all sounds like a win to me.

Change 596538 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/mediawiki-config@master] Explicitly set SwiftFileBackend timeouts

https://gerrit.wikimedia.org/r/596538

Change 596539 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] [WIP] Respect configured default HTTP timeouts, and introduce max timeouts

https://gerrit.wikimedia.org/r/596539

Caller survey:

  • ServiceWiring.php VirtualRESTServiceClient
    • Although it makes a MultiHttpClient with empty parameters, it has per-path configuration including "global" defaults. The global default is 360s request timeout, very generous.
    • PageHTMLHandler: Hopefully not intended for production? Probably needs ~30s request timeout due to slow Parsoid backend. Can have short connect timeout.
  • EtcdConfig: has its own timeout config
  • CdnCacheUpdate: Not used in production. Should have a very short timeout, currently unconfigured (900s).
  • SwiftFileBackend: I submitted 596535 as an intermediate step. The problem with using HttpRequestFactory here is that SwiftFileBackend is a pseudo-library. FileBackendGroup is on the MW side of the boundary, but is generic so doesn't have a concept of HTTP options. How is MW meant to enforce a timeout policy when the admin is directly configuring a pseudo-library, without involvement from MW? I think the architecture of FileBackend in general needs to be reconsidered. In the meantime, we could add maxReqTimeout and maxConnTimeout options to SwiftFileBackend and have WMF config copy the globals into $wgFileBackends.
  • RESTBagOStuff: has its own config, and its own defaults of 1.2/3.0. A pseudo-library which thus suffers from the same problems as SwiftFileBackend.
  • CirrusSearch: has its own timeout config, defaulting to 5/10.
  • EventBus: Although MultiHttpClient is constructed with empty parameters, it sets per-request timeouts based on route config. Production routes have 5s or 10s timeout.
  • Collection: Applies its own interpretation of $wgVirtualRestConfig as documented at T175224. The "global" default of 360s request timeout is apparently applied.
  • ContentTranslation: Also applies its own interpretation of $wgVirtualRestConfig in order to construct a VirtualRESTService. Also probably uses the "global" default of 360s.
  • Echo\ForeignWikiRequest: Apparently has no timeout configuration and will use MultiHttpClient's defaults of 10/900.
  • FileAnnotations: A simple case which would benefit from this task. Currently using 10/900.
  • Flow: Another extension struggling to access RESTBase/Parsoid with VirtualRESTService. It has $wgFlowParsoidTimeout which is 50s in production.
  • Math: Directly accesses $wgVirtualRestConfig, apparently fails to apply "global" key, does not otherwise set timeouts, so it's probably 10/900.
  • MathSearch: Uses VirtualRESTServiceClient as a thin wrapper around a default MultiHttpClient, does not use $wgVirtualRestConfig. Uses default of 10/900.
  • TEI: MultiHttpClient( [] ) 10/900
  • Translate webservices/QueryAggregator: Has per-request timeout conf. Connect timeout is always 3.
  • VEForAll: Probably 360 per $wgVirtualRestConfig['global']
  • VisualEditor: Probably 360 per $wgVirtualRestConfig['global']

I didn't know about the situation with VirtualRESTService and $wgVirtualRestConfig. It's ugly and it's everywhere. It's a big can of worms.

Plan:

  • Of the pseudo-libraries, EtcdConfig and RESTBagOStuff have short default timeouts and so are of low concern for production. They can stay as they are. EtcdConfig is loaded too early to use global config anyway.
  • SwiftFileBackend has a concerning mix of high timeouts, high traffic and an inability to use ServiceWiring. My config change, which @aaron merged, allows it to be configured separately instead of using 10/900 unconditionally. So the way it gets timeout configuration is now similar to EtcdConfig and RESTBagOStuff. Its remaining problems can be split out to a separate task.
  • Everything else can use MediaWikiServices::getHttpRequestFactory()->createMultiClient(). That includes things that currently use VirtualRESTServiceClient. Most things that use VirtualRESTServiceClient will override $wgHTTPTimeout, using 360 instead, but they will respect the maximum timeouts.
  • This change will result in the following things having their request timeout be reduced from 900 to 25. In all cases it appears to be beneficial:
    • CdnCacheUpdate
    • Echo\ForeignWikiRequest
    • FileAnnotations
    • Math
    • MathSearch
    • TEI

There is the question of whether to respect $wgHTTPProxy, as Http::createMultiClient() did. It doesn't matter for production since it is empty there. In general you would expect a proxy to be useful when accessing the external web, but not so much when accessing internal services. Most callers are accessing internal services.

The "global" VirtualRESTService timeout of 360 (6 minutes) is an extremely generous timeout which only really makes sense for write requests in jobs and CLI scripts. I think it would make more sense to reduce the global default and to figure out which, if any, paths or call sites really need such a long timeout.

  • SwiftFileBackend has a concerning mix of high timeouts, high traffic and an inability to use ServiceWiring.

Actually not so high traffic -- on the order of 14 req/s, based on 1 hour of ms-fe1006's access log.

Change 597671 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] Use HttpRequestFactory::createMultiClient() in core where possible

https://gerrit.wikimedia.org/r/597671

Change 597672 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/extensions/CirrusSearch@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597672

Change 597673 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/extensions/Collection@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597673

Change 597674 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/extensions/ContentTranslation@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597674

Change 597675 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/extensions/Echo@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597675

Change 597676 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/extensions/EventBus@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597676

Change 597677 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/extensions/FileAnnotations@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597677

Change 597678 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/extensions/Flow@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597678

Change 597679 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/extensions/Math@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597679

Change 597680 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/extensions/MathSearch@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597680

Change 597681 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/extensions/TEI@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597681

Change 597682 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/extensions/Translate@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597682

Change 597683 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/extensions/VEForAll@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597683

Change 597684 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/extensions/VisualEditor@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597684

Change 596539 merged by jenkins-bot:
[mediawiki/core@master] Respect configured default HTTP timeouts, and introduce max timeouts

https://gerrit.wikimedia.org/r/596539

Change 597677 merged by jenkins-bot:
[mediawiki/extensions/FileAnnotations@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597677

Change 597674 merged by jenkins-bot:
[mediawiki/extensions/ContentTranslation@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597674

Change 597675 merged by jenkins-bot:
[mediawiki/extensions/Echo@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597675

Change 597679 merged by jenkins-bot:
[mediawiki/extensions/Math@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597679

Change 597680 merged by jenkins-bot:
[mediawiki/extensions/MathSearch@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597680

Change 597682 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597682

Change 597684 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597684

Change 597676 merged by jenkins-bot:
[mediawiki/extensions/EventBus@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597676

Change 597673 merged by jenkins-bot:
[mediawiki/extensions/Collection@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597673

Change 597683 merged by jenkins-bot:
[mediawiki/extensions/VEForAll@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597683

Change 597672 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597672

Change 597678 merged by jenkins-bot:
[mediawiki/extensions/Flow@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597678

To do:

  • Deploy the SwiftFileBackend config change, maintaining it at 900
  • Reduce MultiHttpClient default. Maybe to 30?
  • Write and deploy a config patch for max timeout. Should it match the PHP execution time limits in set-time-limit.php?
  • Convince someone involved to comment on the 360s "global" timeout for $wgVirtualRestConfig

Change 596538 merged by jenkins-bot:
[operations/mediawiki-config@master] Explicitly set SwiftFileBackend timeouts

https://gerrit.wikimedia.org/r/596538

Change 597671 merged by jenkins-bot:
[mediawiki/core@master] Use HttpRequestFactory::createMultiClient() in core where possible

https://gerrit.wikimedia.org/r/597671

Change 597681 merged by jenkins-bot:
[mediawiki/extensions/TEI@master] Use HttpRequestFactory::createMultiClient()

https://gerrit.wikimedia.org/r/597681

What do you think about implementing a bot that checks that depends-on-changes also increase the respective version number in extension.json?
For instance, extension math currently requires "MediaWiki": ">= 1.32.0".
In https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Math/+/597679/, which was merged 3h after it was published (which gives extension maintainer little time to look at the change) the effective dependency was changed to 1.35...
I discovered that by chance since it caused problems with my server. All non-cached pages did display Internal error.
While it is not a big issue to me, it might be to people that do not know what to do if they see internal errors.
@Mglaser is this a concern for MediaWiki Stakeholders or am I seeing a problem here, that does not exist?

PS: Updating all dependencies to the latest version solved the problem on my server https://github.com/physikerwelt/mediawiki-docker/commit/42ee7bb8bcd35ae962dc9d9a8a2f4db0d6728c1f

I'll take that as a reminder to check the MediaWiki version in extension.json when I update extensions, that is a fair request. As for the error, in general, the release branch of an extension should be used. For example, if you have MediaWiki 1.32, the REL1_32 branch of extension/Math should be used. That way you will only receive backported updates. For extensions deployed to WMF production, there are weekly release branches, so a bleeding edge production deployment can theoretically be supported by following WMF. Or you can make your own release branches. Extensions which wish to support multiple versions of core in their master branch should have compatibility=master in their {{extension}} invocation on their mediawiki.org extension page, although I would prefer the number of such extensions be kept to an absolute minimum since they make certain kinds of core changes (like this one) impossible.

Change 605439 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] MultiHttpClient: Reduce the default timeout from 900 to 30

https://gerrit.wikimedia.org/r/605439

Change 605440 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/mediawiki-config@master] Set a maximum HTTP client timeout

https://gerrit.wikimedia.org/r/605440

Change 605439 merged by jenkins-bot:
[mediawiki/core@master] MultiHttpClient: Reduce the default timeout from 900 to 30

https://gerrit.wikimedia.org/r/605439

Change 605440 merged by jenkins-bot:
[operations/mediawiki-config@master] Set a maximum HTTP client timeout

https://gerrit.wikimedia.org/r/605440

Mentioned in SAL (#wikimedia-operations) [2020-06-16T00:25:37Z] <tstarling@deploy1001> Synchronized wmf-config/set-time-limit.php: expose excimer timeout as a global variable T245170 (duration: 00m 56s)

Mentioned in SAL (#wikimedia-operations) [2020-06-16T00:28:35Z] <tstarling@deploy1001> Synchronized wmf-config/CommonSettings.php: limit HTTP client timeout T245170 (duration: 00m 56s)

All done, I think.