Diffs sometimes fatal due to request timeout via ApiComparePages (revision comparisons)
Closed, DuplicatePublicPRODUCTION ERROR
Actions

Assigned To

None

Authored By

	Krinkle
	Jan 21 2020, 6:40 PM

Description

Error message

WMFTimeoutException: the execution time limit of 60 seconds was exceeded

from /srv/mediawiki/wmf-config/set-time-limit.php:39

trace

#0 /srv/mediawiki/php-1.35.0-wmf.15/includes/diff/TextSlotDiffRenderer.php(214): {closure}(integer)
#1 /srv/mediawiki/php-1.35.0-wmf.15/includes/diff/TextSlotDiffRenderer.php(140): TextSlotDiffRenderer->getTextDiffInternal(string, string)
#2 /srv/mediawiki/php-1.35.0-wmf.15/includes/poolcounter/PoolCounterWorkViaCallback.php(69): TextSlotDiffRenderer->{closure}()
#3 /srv/mediawiki/php-1.35.0-wmf.15/includes/poolcounter/PoolCounterWork.php(125): PoolCounterWorkViaCallback->doWork()
#4 /srv/mediawiki/php-1.35.0-wmf.15/includes/diff/TextSlotDiffRenderer.php(173): PoolCounterWork->execute()
#5 /srv/mediawiki/php-1.35.0-wmf.15/includes/diff/TextSlotDiffRenderer.php(124): TextSlotDiffRenderer->getTextDiff(string, string)
#6 /srv/mediawiki/php-1.35.0-wmf.15/includes/diff/DifferenceEngine.php(1137): TextSlotDiffRenderer->getDiff(WikitextContent, WikitextContent)
#7 /srv/mediawiki/php-1.35.0-wmf.15/includes/api/ApiComparePages.php(175): DifferenceEngine->getDiffBody()
#8 /srv/mediawiki/php-1.35.0-wmf.15/includes/api/ApiMain.php(1603): ApiComparePages->execute()
#9 /srv/mediawiki/php-1.35.0-wmf.15/includes/api/ApiMain.php(539): ApiMain->executeAction()
#10 /srv/mediawiki/php-1.35.0-wmf.15/includes/api/ApiMain.php(510): ApiMain->executeActionWithErrorHandling()
#11 /srv/mediawiki/php-1.35.0-wmf.15/api.php(78): ApiMain->execute()
#12 /srv/mediawiki/w/api.php(3): require(string)
#13 {main}

Impact

Raised error levels for MediaWiki in production as a whole.

User gets fatal error without a way forward.

Notes

This started 15 Jan 2020 and has a frequently of 80,000 to 100,000 crashes per day, which makes it the most frequent error in production right now by several order of magnitudes.

From https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors.

Details

Request ID: XidCTQpAAEwAAHEeKIsAAACU
Request URL: /w/api.php?action=compare&fromtitle=…&format=json&fromrev=…&torev=…

Related Objects

Mentioned In: T204010: Comparing revisions can fatal (from wikidiff2 via TextSlotDiffRenderer or ApiComparePages)
Mentioned Here: T204010: Comparing revisions can fatal (from wikidiff2 via TextSlotDiffRenderer or ApiComparePages)

Event Timeline

Krinkle created this task.Jan 21 2020, 6:40 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 21 2020, 6:40 PM

Krinkle moved this task from Untriaged to Jan2020/1.35-wmf.14 on the Wikimedia-production-error board.Jan 21 2020, 6:40 PM

Looking at TextSlotDiffRenderer.php(214)[1] these timeouts seem to come from wikidiff2. I just wonder why this is happening so often since 15th January... Some diff cache invalidation?

[1] https://gerrit.wikimedia.org/g/mediawiki/core/+/d1a7e4043f47d291eff516081916d0431e8ecb02/includes/diff/TextSlotDiffRenderer.php#211

There doesn't seem to be anything relevant to MediaWiki-Action-API here.

It looks like there was a sudden increase in requests from one particular IP corresponding with the increase in errors. At a cursory look at the requests from that IP, it seems like someone may have decided to load the diffs of basically every new edit (probably getting the events from EventStream or the like, since the only requests I see are action=compare).

• WDoranWMF triaged this task as High priority.Jan 22 2020, 6:26 PM

• WDoranWMF removed a project: Platform Engineering.

• WDoranWMF assigned this task to • Clarakosi.Jan 22 2020, 6:28 PM

Ideally we'd acknowledge/plug the known timeout behaviour for this API end-point in a way that doesn't spike fatal counts for app servers (or HTTP 5xx levels for edge datacentres). By catching this at a lower level within MediaWiki can also make sure the user gets an experiene that remains in-context and on-brand for the current wiki with e.g. with a localised message that communicates in a few words that there's likely nothing we or they can do about this as the diff might be too complex to efficiently render. Perhaps (for the API use case, rather than the GUI) with a new error code specific to compare that acknowledges this is "normal" and expected behaviour for some revisions.

Even better, but I don't know how easy or hard that is, perhaps we can actually give our diff logic only a limited amont of time or complexity to consider and fallback to a very basic fallback that is still useful to the end-user (e.g. something naîve lke showing the the two revisions side-by-side in their entirety in plain text (like we do for the first revision of a page).

In T243313#5826351, @Krinkle wrote:

Ideally we'd acknowledge/plug the known timeout behaviour for this API end-point in a way that doesn't spike fatal counts for app servers (or HTTP 5xx levels for edge datacentres).

The Action API already catches all Throwables to produce an API error response (which will be served with an HTTP 200). If you're seeing 500s, then whatever error is being raised is apparently not catchable.

(e.g. something naîve lke showing the the two revisions side-by-side in their entirety in plain text (like we do for the first revision of a page).

That doesn't sound like a particularly useful fallback to me.

Mentioned in SAL (#wikimedia-operations) [2020-01-24T07:19:35Z] <_joe_> force run puppet on all esams cache nodes, for mitigation of T243313

Krenair subscribed.Jan 24 2020, 7:42 PM

In T243313#5826825, @Anomie wrote:

In T243313#5826351, @Krinkle wrote:

Ideally we'd acknowledge/plug the known timeout behaviour for this API end-point in a way that doesn't spike fatal counts for app servers (or HTTP 5xx levels for edge datacentres).

The Action API already catches all Throwables to produce an API error response (which will be served with an HTTP 200). […]

Indeed. For the API endpoint that is indeed handled differently. However that doesn't change the severity of the issue as the change in status code is merely a detail of api.php. It doesn't mean the response isn't fatal. As such, our monitoring treats these fatals the same as they would if they ocurred on index.php and raises an alert if their frequency is significantly raised.

In T243313#5826825, @Anomie wrote:

In T243313#5826351, @Krinkle wrote:

(e.g. something naîve lke showing the the two revisions side-by-side in their entirety in plain text (like we do for the first revision of a page).

That doesn't sound like a particularly useful fallback to me.

Perhaps the empty string would suffice then, or the timeout could be formalised as a known failure scenario with a dedicated error code. The take-away is that unless the fatal is due to a simple or obvious programmer error that we can fix, then the error needs to be acknowledged in some way (possibly temporarily) so that we can produce a response that doesn't raise the fatal/exception levels in Logstash, and is (slightly) more useful to users than the current error.

In T243313#5830921, @Krinkle wrote:

or the timeout could be formalised as a known failure scenario with a dedicated error code.

Unfortunately the WMFTimeoutException, as its name implies, is a WMF hack rather than something in MediaWiki itself that the API or the diff code could legitimately handle.

Krinkle renamed this task from Fatal WMFTimeoutException for ApiComparePages requests to Diffs somtimes fatal due to request timeout via ApiComparePages (revision comparisons).Feb 4 2020, 6:19 PM

Krinkle renamed this task from Diffs somtimes fatal due to request timeout via ApiComparePages (revision comparisons) to Diffs sometimes fatal due to request timeout via ApiComparePages (revision comparisons).

I see. Yeah, I suppose that's hard to handle indeed. If the criteria we have to decide whether to "support" a diff is time-spent, then maybe this could be incorporated into the diffing code itself. Eg. between its smaller internal steps, it could throw a diff-specific (not WMF-speific) error that the budget was exceeded.

Alternatively, if we dont think there are other ways diffing can fail and/or if those ways do not need to raise fatal report levels (e.g. if we're certain they won't be caused by MW deployments), then they could be having a catch-all around them in which the user is informed about the diff being unavailable.

@Anomie What direction would you prefer to take this in? Is it plausible for wikidiff2 to always produce a diff within a certain timeframe (within the limits of the text we allow to be fed to the diff engine), and/or are there other ways of falling back that you think would be useful?

I don't know much about wikidiff2, but I suspect it's similar to a parse in that there are probably complex cases that will always take a long time. I also don't know whether or not it has sufficient internal "steps" where a time check could be inserted.

• WDoranWMF moved this task from Inbox to Backlog on the Platform Team Workboards (Clinic Duty Team) board.Mar 24 2020, 3:49 PM

daniel removed • Clarakosi as the assignee of this task.Apr 14 2020, 8:48 PM

daniel moved this task from Backlog to Later on the Platform Team Workboards (Clinic Duty Team) board.

daniel added a subscriber: • Clarakosi.

• Pchelolo edited projects, added Platform Engineering; removed Platform Team Workboards (Clinic Duty Team).Apr 15 2020, 4:35 PM

• Pchelolo moved this task from Inbox to Future Initiatives/Small Projects on the Platform Engineering board.

Naike added a project: Platform Engineering Roadmap Decision Making.Oct 8 2020, 9:56 PM

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:01 PM

Naike moved this task from Untriaged to Icebox on the Platform Engineering Roadmap Decision Making board.Nov 16 2020, 3:25 PM

CCicalese_WMF removed a project: Platform Engineering.Feb 24 2021, 11:37 PM