Page MenuHomePhabricator

action=history with a high limit like >= 2000, can be slow and might timeout
Open, Needs TriagePublicSecurity

Description

While investigating a page I noticed that there were requests being made to for various pages of action=history&limit=2000. MediaWiki gives users the choice of [ 20, 50, 100, 250, 500 ] but caps the max at 5000.

This matches up with https://tendril.wikimedia.org/report/slow_queries?host=%5Edb&user=wikiuser showing the slowest query as /* IndexPager::buildQueryInfo (history page unfiltered) */ with limit 2001.

Some/many (didn't count) of the limit=2000 requests just timed out because of the 60s wall clock limit. It would probably be better to cap the limit at 500 or 1000 rather than trigger these long-running queries that don't finish in time anyways.

web request logs via @colewhite: https://logstash.wikimedia.org/goto/2c5ae987088eac876305e7ad384920c1 - coming from multiple IPs, maybe some script is making these requests? Some UAs seem faked.

Event Timeline

Matching the actual cap with the largest number we display makes sense to me. So 500? Or 1000, but then also show a link for 1000?

Also, we could not seem to find an option to configure this limit in mediawiki-config. It would be desirable to be able to change it without having to make a change in core. (cc: @RLazarus)

As a temporary fix @BBlack deployed a VCL change that blocks '&limit=2000&action=history'.

As Legoktm had pointed out on IRC the 5000 limit is at includes/pager/IndexPager.php#325.

I regularly use limits of 5000 in all the UIs where I can. I think in one of the special pages it was recently changed to a cap of 1000 or so and it is annoying 😅. It would be preferable to leave it be, at least for the users with apihighlimits permission (even though it is not API, but UI) or something.

I have also used limit=5000 many times (e.g. when searching for edits to a page that touched a specific section, based on the edit summary). The page would take a bit longer to display but never timed out. Maybe this is a recent performance regression?

It may or may not be a regression -- we were getting a fair amount of this traffic during the incident, not just isolated requests. One hypothesis is that we can handle a single heavy history query from the occasional power user here and there, but even a smallish distributed flood of them still constitutes a DOS vector.

In principle we could continue to safely offer the high cap, by adding a fairly sophisticated rate limiter that accounts for request cost, so that expensive requests like this could be charged against a much lower rate limit than ordinary traffic (but still nonzero). That could be based on anything from a rough classification into discrete categories of "small" and "large" requests, to a fully dynamic token-bucket system where tokens are withdrawn according to actual measured resource consumption.

We've kicked those ideas around a few times before, but we don't have the apparatus to build them out properly right now. In the short-to-medium term, if we can make this query more performant, so much the better -- but failing that, I think we need to at least add the config knob @Dzahn mentions, so that we can respond when it's used as an attack vector.

We could also throttle it in MW with User::pingLimiter - it stores throttles in ObjectCache::getLocalClusterInstance.
The throttler is not called from the endpoint in question yet, so it needs a core change. After that the settings are configurable and pretty flexible https://www.mediawiki.org/wiki/Manual:$wgRateLimits

I have also used limit=5000 many times (e.g. when searching for edits to a page that touched a specific section, based on the edit summary). The page would take a bit longer to display but never timed out. Maybe this is a recent performance regression?

The excimer 60 second timeout is somewhat new. Looking through the timeout tracebacks in logstash there are two common ones:

#10 /srv/mediawiki/php-1.37.0-wmf.7/includes/cache/LinkCache.php(350): Wikimedia\Rdbms\DBConnRef->selectRow(string, array, array, string)
#11 /srv/mediawiki/php-1.37.0-wmf.7/includes/cache/LinkCache.php(299): LinkCache->fetchPageRow(Wikimedia\Rdbms\DBConnRef, Title)
#12 /srv/mediawiki/php-1.37.0-wmf.7/includes/linker/LinkRenderer.php(428): LinkCache->addLinkObj(Title)
#13 /srv/mediawiki/php-1.37.0-wmf.7/includes/linker/LinkRenderer.php(244): MediaWiki\Linker\LinkRenderer->getLinkClasses(Title)
#14 /srv/mediawiki/php-1.37.0-wmf.7/includes/actions/pagers/HistoryPager.php(567): MediaWiki\Linker\LinkRenderer->makeKnownLink(Title, HtmlArmor, array, array)
#15 /srv/mediawiki/php-1.37.0-wmf.7/includes/actions/pagers/HistoryPager.php(362): HistoryPager->curLink(MediaWiki\Revision\RevisionStoreRecord)
#16 /srv/mediawiki/php-1.37.0-wmf.7/includes/actions/pagers/HistoryPager.php(159): HistoryPager->historyLine(stdClass, stdClass, boolean, boolean, boolean)

Why isn't this title being batched and already in the LinkCache?

And then inserting the Thanks link:

#13 /srv/mediawiki/php-1.37.0-wmf.7/extensions/Thanks/includes/Hooks.php(64): MediaWiki\Revision\RevisionStore->getPreviousRevision(MediaWiki\Revision\RevisionStoreRecord)
#14 /srv/mediawiki/php-1.37.0-wmf.7/includes/HookContainer/HookContainer.php(330): MediaWiki\Extension\Thanks\Hooks::insertThankLink(MediaWiki\Revision\RevisionStoreRecord, array, MediaWiki\Revision\RevisionStoreRecord, User)
#15 /srv/mediawiki/php-1.37.0-wmf.7/includes/HookContainer/HookContainer.php(137): MediaWiki\HookContainer\HookContainer->callLegacyHook(string, array, array, array)
#16 /srv/mediawiki/php-1.37.0-wmf.7/includes/HookContainer/HookRunner.php(1998): MediaWiki\HookContainer\HookContainer->run(string, array)
#17 /srv/mediawiki/php-1.37.0-wmf.7/includes/actions/pagers/HistoryPager.php(494): MediaWiki\HookContainer\HookRunner->onHistoryTools(MediaWiki\Revision\RevisionStoreRecord, array, MediaWiki\Revision\RevisionStoreRecord, User)
#18 /srv/mediawiki/php-1.37.0-wmf.7/includes/actions/pagers/HistoryPager.php(159): HistoryPager->historyLine(stdClass, stdClass, boolean, boolean, boolean)

I'm not really opposed to adding a "uihighlimits" to match "apihighlimits" as @Base suggested but it doesn't feel super great.

We could also throttle it in MW with User::pingLimiter - it stores throttles in ObjectCache::getLocalClusterInstance.
The throttler is not called from the endpoint in question yet, so it needs a core change. After that the settings are configurable and pretty flexible https://www.mediawiki.org/wiki/Manual:$wgRateLimits

I'm not sure this would work since the requests were coming from all over the place, not just one IP or a single range of IPs.

Does this task really need the view policy when apparently everyone knows about it already?

Does this task really need the view policy when apparently everyone knows about it already?

I made it private because it triggered pages and we responded to it in an incident-like fashion. I'll make it public now though.

Legoktm changed the visibility from "Custom Policy" to "Public (No Login Required)".Jun 16 2021, 7:49 PM
Legoktm changed the edit policy from "Custom Policy" to "All Users".

Why isn't this title being batched and already in the LinkCache?

History page calls RevisionStore::getRelativeRevision whichs calls Title::castFromPageReference and that ends up in Title::resetArticleID which removes the (preloaded) link via LinkCache::clearLink

Looks like a issue from https://gerrit.wikimedia.org/r/c/mediawiki/core/+/673140 where a newFromRow is replaced by the cast, newFromRow does not reset id

I used https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/701995 to generate a log of queries. I logged about 350 queries for an English Wikipedia history view with limit=50. It's pretty bad. I'm treating this bug as "history is slow" not "5000 is too many", at least until we've addressed some of the low-hanging fruit.

Change 702000 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] Add statsd timing for actions

https://gerrit.wikimedia.org/r/702000

Change 702005 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/Thanks@master] Fix history tools performance

https://gerrit.wikimedia.org/r/702005

Change 702000 merged by jenkins-bot:

[mediawiki/core@master] Add statsd timing for actions

https://gerrit.wikimedia.org/r/702000

Change 702010 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@wmf/1.37.0-wmf.11] Add statsd timing for actions

https://gerrit.wikimedia.org/r/702010

Change 702010 merged by jenkins-bot:

[mediawiki/core@wmf/1.37.0-wmf.11] Add statsd timing for actions

https://gerrit.wikimedia.org/r/702010

Mentioned in SAL (#wikimedia-operations) [2021-06-29T06:46:18Z] <tstarling@deploy1002> Synchronized php-1.37.0-wmf.11/includes/MediaWiki.php: Add statsd action timing metric T284274 (duration: 00m 58s)

Change 702005 merged by jenkins-bot:

[mediawiki/extensions/Thanks@master] Fix history tools performance

https://gerrit.wikimedia.org/r/702005

Low-hanging fruit in ascending order by altitude:

  • Thanks (done above).
  • GlobalUserPage (T101348). I wrote up some ideas at T88644.
  • Linker::formatComment(). Maybe it just looks high up because nobody has touched it for so long. We need to be able to format a batch of comments, instead of just doing one at a time, so that we can do batch existence checks. The issue dates back to 2004, but I don't think there's a bug for it.
  • Linker::formatComment(). Maybe it just looks high up because nobody has touched it for so long. We need to be able to format a batch of comments, instead of just doing one at a time, so that we can do batch existence checks. The issue dates back to 2004, but I don't think there's a bug for it.

I created T285917 for this.

Legoktm renamed this task from action=history allows for limits as high as 5000, which is probably too high to action=history with a high limit like >= 2000, can be slow and might timeout.Aug 4 2021, 8:17 PM