Maniphest T152073

timeout limits and syncronize those between services
Closed, ResolvedPublic
Actions

Description

At the moment, the WMF cluster has various call chains between different services, which have caused various issues and overloads like the parent task of this one.

Specifically, one of the longest "chain reactions" is the one that from an event in MediaWiki cause ChangePropagation to call Restbase, restbase to call Parsoid, and Parsoid to call the MediaWiki API.

If any on these components fail or takes a long time to respond, the other ones might abort the request, but that doesn't necessarily mean it will be aborted server-side as well and the processing will stop, as timeouts might be very different between the caller and the callee.

Also, while ChangePropagation has its own concurrency limiting system (which we should review anyways), the policies for retrying upon failure in both restbase and parsoid should be checked and tuned.

A similar audit will be needed for all other services that call other services in a cascade, including MediaWiki.

This is tangentially related to T97192

Audit data

ChangeProp

Concurrency limits: Dependent on the rule, 50 by default, 400 for transclusion updates, 15 for ORES
Request timeout: 7 minutes by default for requests to RESTBase, 2 minutes for direct requests to MW API, configurable per rule
Request retry policy: No immediate retry; Up to two delayed retries for status dependent on the rule. Normally 5xx only are retried but for some rules it may be different depending on the context. For example for revision creation 404 is retried to account for DB replication lag. Default retry after 1 minute, next after 7 minutes. Up to 2 retries by default.

RESTBase

Response timeout: 6 minutes
Subrequest timeout: 2 minutes; timeout doubled + fuzzed on retry.
Subrequest retry policy: Only retry once on timeout or 503 with retry-after. Retry delay is retry-after value, or ~500ms plus exponential back-off.

Parsoid

Response timeout: 3 minutes
Subrequest timeout (PHP API): 60 seconds
Subrequest retry policy: Only retry once on timeout or 503 with retry-after. Retry delay is retry-after value, or ~500ms plus exponential back-off.

MediaWiki API

Response timeout: Nominally 60s, but timeout does not seem to be working (see T97192).

Issues found

[DONE]: ChangeProp request timeout lower than RESTBase's response timeout. Fix in https://github.com/wikimedia/change-propagation/pull/141.
Parsoid response timeout larger than RESTBase request timeout. https://gerrit.wikimedia.org/r/325708, @Arlolra looking into lowering worker heartbeat timeout as well.
Parsoid's MW API request timeout equal to MediaWiki's nominal response timeout. @Arlolra @ssastry: Could you increase the Parsoid timeout to slightly above MediaWiki's current response timeout, perhaps to 65s?
It is unclear if MediaWiki's 60s response timeout actually works (T97192).

Details

	Subject	Repo	Branch	Lines +/-
	Update timeout values	mediawiki/services/parsoid/deploy	master	+14 -8

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• demon	T150465 MW-1.29.0-wmf.4 deployment blockers
Resolved	• ssastry	T151702 API cluster failure / OOM
Resolved	• Pchelolo	T152073 Check concurrency/retry/timeout limits and syncronize those between services

Event Timeline

Joe created this task.Dec 1 2016, 6:49 AM

See also:

The latter document is intended to be a living document, to be updated to reflect our learnings and thinking.

Arlolra moved this task from Needs Triage to Non-Parsing-Team Tasks on the Parsoid board.Dec 2 2016, 7:34 PM

• GWicke updated the task description. (Show Details)Dec 6 2016, 6:22 PM

• GWicke updated the task description. (Show Details)Dec 6 2016, 6:27 PM

• GWicke updated the task description. (Show Details)

• GWicke updated the task description. (Show Details)Dec 6 2016, 6:30 PM

• Pchelolo updated the task description. (Show Details)Dec 6 2016, 6:32 PM

• GWicke updated the task description. (Show Details)Dec 6 2016, 6:43 PM

• GWicke updated the task description. (Show Details)Dec 6 2016, 6:47 PM

• Pchelolo updated the task description. (Show Details)Dec 6 2016, 6:48 PM

• GWicke updated the task description. (Show Details)Dec 6 2016, 6:49 PM

• Pchelolo updated the task description. (Show Details)Dec 6 2016, 6:50 PM

greg added a project: Wikimedia-Incident.Dec 6 2016, 10:37 PM

greg moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.

• GWicke updated the task description. (Show Details)Dec 6 2016, 10:48 PM

• GWicke updated the task description. (Show Details)Dec 6 2016, 10:52 PM

@ssastry @Arlolra: We need to determine what the actual upper bound on render times we want to support is, and then set Parsoid's response time limit to that value. The RB request limit then needs to be higher than Parsoid's response timeout, so that we don't needlessly retry.

I know we discussed this issue before, but somehow it seems that the adjustment has never happened.

Edit: Previous related change was https://phabricator.wikimedia.org/rGPAD5300ede8302bb9db84df191ff17182429f7a0667.

@GWicke, we set a 3-min render timeout in parsoid so that restbase retries have a chance of succeeding in case the page genuinely takes a bit longer than 2 mins. See https://github.com/wikimedia/mediawiki-services-parsoid-deploy/blob/master/scap/templates/config.yaml.j2#L47-L53 Otherwise, RB''s higher timeout value (on a retry) is of no use. But, we can limit it to 2 mins. In that case, RB need not bump its timeout value on retry.

But, we can limit it to 2 mins. In that case, RB need not bump its timeout value on retry.

110 seconds would be better, as this would be less than the RESTBase client timeout, and would thus avoid retries for slow requests altogether. See https://www.mediawiki.org/wiki/Rules_of_thumb_for_robust_service_infrastructure#Retries for the reasoning behind coordinating timeouts to avoid retries.

Change 325708 had a related patch set uploaded (by Subramanya Sastry):
Reduce Parsoid request timeout to 110s

https://gerrit.wikimedia.org/r/325708

gerritbot added a project: Patch-For-Review.Dec 6 2016, 11:27 PM

• GWicke updated the task description. (Show Details)Dec 7 2016, 12:16 AM

Change 325708 merged by jenkins-bot:
Update timeout values

https://gerrit.wikimedia.org/r/325708

• GWicke mentioned this in T152074: Separate clusters for asynchronous processing from the ones for public consumption.Dec 7 2016, 8:11 PM

• mobrovac edited projects, added Services (watching); removed Services.Dec 16 2016, 10:12 PM

Joe added a project: User-Joe.Dec 30 2016, 8:54 AM

Is there anything left to be done here?

hashar unsubscribed.Jul 19 2017, 12:38 PM