Page MenuHomePhabricator

Check concurrency/retry/timeout limits and syncronize those between services
Closed, ResolvedPublic

Description

At the moment, the WMF cluster has various call chains between different services, which have caused various issues and overloads like the parent task of this one.

Specifically, one of the longest "chain reactions" is the one that from an event in MediaWiki cause ChangePropagation to call Restbase, restbase to call Parsoid, and Parsoid to call the MediaWiki API.

If any on these components fail or takes a long time to respond, the other ones might abort the request, but that doesn't necessarily mean it will be aborted server-side as well and the processing will stop, as timeouts might be very different between the caller and the callee.

Also, while ChangePropagation has its own concurrency limiting system (which we should review anyways), the policies for retrying upon failure in both restbase and parsoid should be checked and tuned.

A similar audit will be needed for all other services that call other services in a cascade, including MediaWiki.

This is tangentially related to T97192

Audit data

ChangeProp

  • Concurrency limits: Dependent on the rule, 50 by default, 400 for transclusion updates, 15 for ORES
  • Request timeout: 7 minutes by default for requests to RESTBase, 2 minutes for direct requests to MW API, configurable per rule
  • Request retry policy: No immediate retry; Up to two delayed retries for status dependent on the rule. Normally 5xx only are retried but for some rules it may be different depending on the context. For example for revision creation 404 is retried to account for DB replication lag. Default retry after 1 minute, next after 7 minutes. Up to 2 retries by default.

RESTBase

  • Response timeout: 6 minutes
  • Subrequest timeout: 2 minutes; timeout doubled + fuzzed on retry.
  • Subrequest retry policy: Only retry once on timeout or 503 with retry-after. Retry delay is retry-after value, or ~500ms plus exponential back-off.

Parsoid

  • Response timeout: 3 minutes
  • Subrequest timeout (PHP API): 60 seconds
  • Subrequest retry policy: Only retry once on timeout or 503 with retry-after. Retry delay is retry-after value, or ~500ms plus exponential back-off.

MediaWiki API

  • Response timeout: Nominally 60s, but timeout does not seem to be working (see T97192).

Issues found

Event Timeline

See also:

The latter document is intended to be a living document, to be updated to reflect our learnings and thinking.

@ssastry @Arlolra: We need to determine what the actual upper bound on render times we want to support is, and then set Parsoid's response time limit to that value. The RB request limit then needs to be higher than Parsoid's response timeout, so that we don't needlessly retry.

I know we discussed this issue before, but somehow it seems that the adjustment has never happened.

Edit: Previous related change was https://phabricator.wikimedia.org/rGPAD5300ede8302bb9db84df191ff17182429f7a0667.

@GWicke, we set a 3-min render timeout in parsoid so that restbase retries have a chance of succeeding in case the page genuinely takes a bit longer than 2 mins. See https://github.com/wikimedia/mediawiki-services-parsoid-deploy/blob/master/scap/templates/config.yaml.j2#L47-L53 Otherwise, RB''s higher timeout value (on a retry) is of no use. But, we can limit it to 2 mins. In that case, RB need not bump its timeout value on retry.

But, we can limit it to 2 mins. In that case, RB need not bump its timeout value on retry.

110 seconds would be better, as this would be less than the RESTBase client timeout, and would thus avoid retries for slow requests altogether. See https://www.mediawiki.org/wiki/Rules_of_thumb_for_robust_service_infrastructure#Retries for the reasoning behind coordinating timeouts to avoid retries.

Change 325708 had a related patch set uploaded (by Subramanya Sastry):
Reduce Parsoid request timeout to 110s

https://gerrit.wikimedia.org/r/325708

Change 325708 merged by jenkins-bot:
Update timeout values

https://gerrit.wikimedia.org/r/325708

Is there anything left to be done here?

Is there anything left to be done here?

I had the same question here. Can this be closed and new tasks created for anything that still needs doing?

Pchelolo claimed this task.
Pchelolo edited projects, added Services (done); removed Services (watching).
Pchelolo subscribed.

I guess it can be closed now, there's been no activity here lately