At the moment, the WMF cluster has various call chains between different services, which have caused various issues and overloads like the parent task of this one.
Specifically, one of the longest "chain reactions" is the one that from an event in MediaWiki cause ChangePropagation to call Restbase, restbase to call Parsoid, and Parsoid to call the MediaWiki API.
If any on these components fail or takes a long time to respond, the other ones might abort the request, but that doesn't necessarily mean it will be aborted server-side as well and the processing will stop, as timeouts might be very different between the caller and the callee.
Also, while ChangePropagation has its own concurrency limiting system (which we should review anyways), the policies for retrying upon failure in both restbase and parsoid should be checked and tuned.
A similar audit will be needed for all other services that call other services in a cascade, including MediaWiki.
This is tangentially related to T97192
## Audit data
ChangeProp
- Request timeout: _ minutes. Timeout doubled + fuzzed on retry.
- Request retry policy: No immediate retry; Up to two delayed retries for status [fill me in]. Default retry 6 minutes + exponential back-off.
RESTBase
- Response timeout: 6 minutes
- Subrequest timeout: 2 minutes; timeout doubled + fuzzed on retry.
- Subrequest retry policy: Retry once on timeout or 503 with retry-after. Retry delay is retry-after value, or ~500ms plus exponential back-off.
Parsoid
- Response timeout:
- Subrequest timeout (PHP API): 60 seconds
- Subrequest retry policy: Retry once on timeout or 503 with retry-after. Retry delay is retry-after value, or ~500ms plus exponential back-off.
MediaWiki API
- Response timeout: Nominally 60s, but timeout does not seem to be working (see T97192).