At the moment, the WMF cluster has various call chains between different services, which have caused various issues and overloads like the parent task of this one.
Specifically, one of the longest "chain reactions" is the one that from an event in MediaWiki cause ChangePropagation to call Restbase, restbase to call Parsoid, and Parsoid to call the MediaWiki API.
If any on these components fail or takes a long time to respond, the other ones might abort the request, but that doesn't necessarily mean it will be aborted server-side as well and the processing will stop, as timeouts might be very different between the caller and the callee.
Also, while ChangePropagation has its own concurrency limiting system (which we should review anyways), the policies for retrying upon failure in both restbase and parsoid should be checked and tuned.
A similar audit will be needed for all other services that call other services in a cascade, including MediaWiki.
This is tangentially related to T97192
## Audit data
ChangeProp
- Concurrency limits:
- Request timeout: 7 minutes by default for requests to RESTBase, 2 minutes for direct requests to MW API, configurable per rule
- Request retry policy: No immediate retry; Up to two delayed retries for status dependent on the rule. Normally 5xx only are retried but for some rules it may be different depending on the context. For example for revision creation 404 is retried to account for DB replication lag. Default retry after 1 minute, next after 7 minutes. Up to 2 retries by default.
RESTBase
- Response timeout: 6 minutes
- Subrequest timeout: 2 minutes; timeout doubled + fuzzed on retry.
- Subrequest retry policy: Only retry once on timeout or 503 with retry-after. Retry delay is retry-after value, or ~500ms plus exponential back-off.
Parsoid
- Response timeout:
- Subrequest timeout (PHP API): 60 seconds
- Subrequest retry policy: Only retry once on timeout or 503 with retry-after. Retry delay is retry-after value, or ~500ms plus exponential back-off.
MediaWiki API
- Response timeout: Nominally 60s, but timeout does not seem to be working (see T97192).
## Issues found
- ChangeProp request timeout lower than RESTBase's response timeout. Fix in https://github.com/wikimedia/change-propagation/pull/141.
- Parsoid's MW API request timeout equal to MediaWiki's nominal response timeout.
- Unclear if MediaWiki's 60s response timeout actually works (T97192).