Page MenuHomePhabricator

Diffs sometimes fatal due to request timeout via ApiComparePages (revision comparisons)
Closed, DuplicatePublicPRODUCTION ERROR

Description

Error message
WMFTimeoutException: the execution time limit of 60 seconds was exceeded

from /srv/mediawiki/wmf-config/set-time-limit.php:39
trace
#0 /srv/mediawiki/php-1.35.0-wmf.15/includes/diff/TextSlotDiffRenderer.php(214): {closure}(integer)
#1 /srv/mediawiki/php-1.35.0-wmf.15/includes/diff/TextSlotDiffRenderer.php(140): TextSlotDiffRenderer->getTextDiffInternal(string, string)
#2 /srv/mediawiki/php-1.35.0-wmf.15/includes/poolcounter/PoolCounterWorkViaCallback.php(69): TextSlotDiffRenderer->{closure}()
#3 /srv/mediawiki/php-1.35.0-wmf.15/includes/poolcounter/PoolCounterWork.php(125): PoolCounterWorkViaCallback->doWork()
#4 /srv/mediawiki/php-1.35.0-wmf.15/includes/diff/TextSlotDiffRenderer.php(173): PoolCounterWork->execute()
#5 /srv/mediawiki/php-1.35.0-wmf.15/includes/diff/TextSlotDiffRenderer.php(124): TextSlotDiffRenderer->getTextDiff(string, string)
#6 /srv/mediawiki/php-1.35.0-wmf.15/includes/diff/DifferenceEngine.php(1137): TextSlotDiffRenderer->getDiff(WikitextContent, WikitextContent)
#7 /srv/mediawiki/php-1.35.0-wmf.15/includes/api/ApiComparePages.php(175): DifferenceEngine->getDiffBody()
#8 /srv/mediawiki/php-1.35.0-wmf.15/includes/api/ApiMain.php(1603): ApiComparePages->execute()
#9 /srv/mediawiki/php-1.35.0-wmf.15/includes/api/ApiMain.php(539): ApiMain->executeAction()
#10 /srv/mediawiki/php-1.35.0-wmf.15/includes/api/ApiMain.php(510): ApiMain->executeActionWithErrorHandling()
#11 /srv/mediawiki/php-1.35.0-wmf.15/api.php(78): ApiMain->execute()
#12 /srv/mediawiki/w/api.php(3): require(string)
#13 {main}
Impact

Raised error levels for MediaWiki in production as a whole.

User gets fatal error without a way forward.

Notes

This started 15 Jan 2020 and has a frequently of 80,000 to 100,000 crashes per day, which makes it the most frequent error in production right now by several order of magnitudes.

From https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors.

cap.png (508×2 px, 52 KB)

Details

Request ID
XidCTQpAAEwAAHEeKIsAAACU
Request URL
/w/api.php?action=compare&fromtitle=…&format=json&fromrev=…&torev=…

Event Timeline

Looking at TextSlotDiffRenderer.php(214)[1] these timeouts seem to come from wikidiff2. I just wonder why this is happening so often since 15th January... Some diff cache invalidation?

[1] https://gerrit.wikimedia.org/g/mediawiki/core/+/d1a7e4043f47d291eff516081916d0431e8ecb02/includes/diff/TextSlotDiffRenderer.php#211

Anomie subscribed.

There doesn't seem to be anything relevant to MediaWiki-Action-API here.

It looks like there was a sudden increase in requests from one particular IP corresponding with the increase in errors. At a cursory look at the requests from that IP, it seems like someone may have decided to load the diffs of basically every new edit (probably getting the events from EventStream or the like, since the only requests I see are action=compare).

WDoranWMF removed a project: Platform Engineering.

Ideally we'd acknowledge/plug the known timeout behaviour for this API end-point in a way that doesn't spike fatal counts for app servers (or HTTP 5xx levels for edge datacentres). By catching this at a lower level within MediaWiki can also make sure the user gets an experiene that remains in-context and on-brand for the current wiki with e.g. with a localised message that communicates in a few words that there's likely nothing we or they can do about this as the diff might be too complex to efficiently render. Perhaps (for the API use case, rather than the GUI) with a new error code specific to compare that acknowledges this is "normal" and expected behaviour for some revisions.

Even better, but I don't know how easy or hard that is, perhaps we can actually give our diff logic only a limited amont of time or complexity to consider and fallback to a very basic fallback that is still useful to the end-user (e.g. something naîve lke showing the the two revisions side-by-side in their entirety in plain text (like we do for the first revision of a page).

Ideally we'd acknowledge/plug the known timeout behaviour for this API end-point in a way that doesn't spike fatal counts for app servers (or HTTP 5xx levels for edge datacentres).

The Action API already catches all Throwables to produce an API error response (which will be served with an HTTP 200). If you're seeing 500s, then whatever error is being raised is apparently not catchable.

(e.g. something naîve lke showing the the two revisions side-by-side in their entirety in plain text (like we do for the first revision of a page).

That doesn't sound like a particularly useful fallback to me.

Mentioned in SAL (#wikimedia-operations) [2020-01-24T07:19:35Z] <_joe_> force run puppet on all esams cache nodes, for mitigation of T243313

Ideally we'd acknowledge/plug the known timeout behaviour for this API end-point in a way that doesn't spike fatal counts for app servers (or HTTP 5xx levels for edge datacentres).

The Action API already catches all Throwables to produce an API error response (which will be served with an HTTP 200). […]

Indeed. For the API endpoint that is indeed handled differently. However that doesn't change the severity of the issue as the change in status code is merely a detail of api.php. It doesn't mean the response isn't fatal. As such, our monitoring treats these fatals the same as they would if they ocurred on index.php and raises an alert if their frequency is significantly raised.

(e.g. something naîve lke showing the the two revisions side-by-side in their entirety in plain text (like we do for the first revision of a page).

That doesn't sound like a particularly useful fallback to me.

Perhaps the empty string would suffice then, or the timeout could be formalised as a known failure scenario with a dedicated error code. The take-away is that unless the fatal is due to a simple or obvious programmer error that we can fix, then the error needs to be acknowledged in some way (possibly temporarily) so that we can produce a response that doesn't raise the fatal/exception levels in Logstash, and is (slightly) more useful to users than the current error.

or the timeout could be formalised as a known failure scenario with a dedicated error code.

Unfortunately the WMFTimeoutException, as its name implies, is a WMF hack rather than something in MediaWiki itself that the API or the diff code could legitimately handle.

Krinkle renamed this task from Fatal WMFTimeoutException for ApiComparePages requests to Diffs somtimes fatal due to request timeout via ApiComparePages (revision comparisons).Feb 4 2020, 6:19 PM
Krinkle renamed this task from Diffs somtimes fatal due to request timeout via ApiComparePages (revision comparisons) to Diffs sometimes fatal due to request timeout via ApiComparePages (revision comparisons).

I see. Yeah, I suppose that's hard to handle indeed. If the criteria we have to decide whether to "support" a diff is time-spent, then maybe this could be incorporated into the diffing code itself. Eg. between its smaller internal steps, it could throw a diff-specific (not WMF-speific) error that the budget was exceeded.

Alternatively, if we dont think there are other ways diffing can fail and/or if those ways do not need to raise fatal report levels (e.g. if we're certain they won't be caused by MW deployments), then they could be having a catch-all around them in which the user is informed about the diff being unavailable.

@Anomie What direction would you prefer to take this in? Is it plausible for wikidiff2 to always produce a diff within a certain timeframe (within the limits of the text we allow to be fed to the diff engine), and/or are there other ways of falling back that you think would be useful?

I don't know much about wikidiff2, but I suspect it's similar to a parse in that there are probably complex cases that will always take a long time. I also don't know whether or not it has sufficient internal "steps" where a time check could be inserted.

daniel moved this task from Backlog to Later on the Platform Team Workboards (Clinic Duty Team) board.
daniel added a subscriber: Clarakosi.

Could be treated as duplicate of T204010, but not sure