Page MenuHomePhabricator

Some Special:Contributions requests cause "Error: 0" from database or WMFTimeoutException
Open, HighPublic

Description

Error message
A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? 

Function: IndexPager::buildQueryInfo (contributions page unfiltered)
Error: 0
Impact

~1400 errors an hour in logstash; this is by far the main production error right now.

Notes

Request of limit of 50000(!) (which MW moderated down to 5000) and timerange of 2005–2017 suggests this is a query from a badly-behaved bot or scraper. The UA used by this client was just an unchanged-from-defaults python-requests/x.y.z. In the future we should find time to heavily ratelimit or block default User-Agents per our own unenforced policy.

Details

Request ID
AW2NGujwx3rdj6D82Ozi
Request URL
en.wikipedia.org/w/index.php?limit=50000&title=Special%3AContributions&contribs=user&target=AxG&namespace=&tagfilter=&start=2005-01-01&end=2017-06-30
Stack Trace
#0 /srv/mediawiki/php-1.34.0-wmf.24/includes/libs/rdbms/database/Database.php(1573): Wikimedia\Rdbms\Database->getQueryExceptionAndLog(NULL, integer, string, string)
#1 /srv/mediawiki/php-1.34.0-wmf.24/includes/libs/rdbms/database/Database.php(1152): Wikimedia\Rdbms\Database->reportQueryError(NULL, integer, string, string, boolean)
#2 /srv/mediawiki/php-1.34.0-wmf.24/includes/libs/rdbms/database/Database.php(1806): Wikimedia\Rdbms\Database->query(string, string)
#3 /srv/mediawiki/php-1.34.0-wmf.24/includes/libs/rdbms/database/DBConnRef.php(68): Wikimedia\Rdbms\Database->select(array, array, array, string, array, array)
#4 /srv/mediawiki/php-1.34.0-wmf.24/includes/libs/rdbms/database/DBConnRef.php(315): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#5 /srv/mediawiki/php-1.34.0-wmf.24/includes/specials/pagers/ContribsPager.php(205): Wikimedia\Rdbms\DBConnRef->select(array, array, array, string, array, array)
#6 /srv/mediawiki/php-1.34.0-wmf.24/includes/pager/IndexPager.php(263): ContribsPager->reallyDoQuery(string, integer, boolean)
#7 /srv/mediawiki/php-1.34.0-wmf.24/includes/pager/IndexPager.php(609): IndexPager->doQuery()
#8 /srv/mediawiki/php-1.34.0-wmf.24/includes/specials/SpecialContributions.php(226): IndexPager->getNumRows()
#9 /srv/mediawiki/php-1.34.0-wmf.24/includes/specialpage/SpecialPage.php(575): SpecialContributions->execute(NULL)
#10 /srv/mediawiki/php-1.34.0-wmf.24/includes/specialpage/SpecialPageFactory.php(611): SpecialPage->run(NULL)
#11 /srv/mediawiki/php-1.34.0-wmf.24/includes/MediaWiki.php(296): MediaWiki\Special\SpecialPageFactory->executePath(Title, RequestContext)
#12 /srv/mediawiki/php-1.34.0-wmf.24/includes/MediaWiki.php(896): MediaWiki->performRequest()
#13 /srv/mediawiki/php-1.34.0-wmf.24/includes/MediaWiki.php(527): MediaWiki->main()
#14 /srv/mediawiki/php-1.34.0-wmf.24/index.php(44): MediaWiki->run()
#15 /srv/mediawiki/w/index.php(3): require(string)
#16 {main}

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

How hard would it be to make the limit 500 instead of 5000?

Would we need to communicate such a change with the community?

mark added a subscriber: mark.

Could CPT take a look at this please? Thanks!

@mark yep, absolutely.

@Anomie Could you review this next? The limit change would be able to go out in SWAT right?

I agree, it looks like the query itself is behaving well and the problem was most likely caused by someone hammering the special page with the limit of 5000. May as well try reducing the limit to 500.

Ugh, private security bug :(

User-notice: Probably a notice something like this would be appropriate. I'll leave it up to you to decide which week to send it out depending on when the patch above gets merged.

The limit for Special:Contributions can be set as high as 5000 by editing the URL. This will be lowered to 500 to improve performance.

sbassett added a subscriber: Reedy.Tue, Oct 22, 2:44 PM

+1 to the patch above.

@Anomie @WDoranWMF - @Reedy or I could also sec-deploy this sooner, request a CVE and track it at T233495. Just let us know.

@sbassett if you could that would be great, thanks!

Reedy added a comment.EditedTue, Oct 22, 5:20 PM

Would we need to communicate such a change with the community?

History says it won't take long before someone notices and reports a "limit=foo isn't returning foo results" bug.

Obviously any User-notice type thing will make the mitigation public, meaning the security bug doesn't really have much use staying private. And if we take into account my comment above, someone will notice...

I do question if that patch should be landing into master

Were there many concurrent requests from the same person? Should we be looking at preventing so many concurrent requests?

The bug was turned into a private issue because of potential DoS concerns: a relatively low volume of these long-date-range limit=5000 Special:Contributions queries (a volume that I could trivially send from my laptop, for instance) create sufficient load to cause a great deal of hurt for our database servers.

I think the underlying issue here is more an artifact of the size of the data set of enwiki, rather than an inherent security flaw in Mediawiki, but that's just an opinion.

My thoughts are that we should both land the patch in master, and communicate via User-notice as someone will notice. limit=5000 queries aren't too uncommon -- according to Turnilo we received over 10 million queries matching regex Special.*limit=5000 just in the last week.

Were there many concurrent requests from the same person?

In this case, yes, the queries were all from one user running a scraper script (User-Agent: python-requests/x.y.z), first from an Indian university and then from Amazon EC2. To prevent an outage, we needed to block their IP in both cases (and now have a mechanism for quickly blocking just "bot-like" User-Agents like python-requests/foo).

We reached out to them several ways -- via all of contacting their university, contacting AWS's abuse department, and via the error message we served them, asking them to email noc@wikimedia.org so we could give advice on how to send less harmful traffic -- but never heard anything back.

Should we be looking at preventing so many concurrent requests?

IMO very much yes, but it certainly is not trivial to do so in our infrastructure. What we would really want here is an understanding of which queries are cheap vs expensive, and a way to account for that for scraper traffic...

Update: @Anomie's patch from T234450#5595510 was deployed to wmf.2 (sal) and committed in /srv/patches/1.35.0-wmf.3/core/ and applied to /srv/mediawiki-staging/php-1.35.0-wmf.3 (sal) per conversation with @LarsWirzenius, @brennen and @thcipriani. The test URL from the description now only returns 500 rows no matter what limit is set to.

Per comments from @Reedy and @CDanis above, I'll hold off on requesting a CVE and tracking at T233495 if the patch won't be backported.

Reedy added a comment.Tue, Oct 22, 6:12 PM

Should we be looking at preventing so many concurrent requests?

IMO very much yes, but it certainly is not trivial to do so in our infrastructure. What we would really want here is an understanding of which queries are cheap vs expensive, and a way to account for that for scraper traffic...

MW has rate limiting...

@Reedy Can you elaborate? The only rate-limiting I'm aware of is the simultaneous edit limiting provided by PoolCounter, and the limited support for limiting simultaneous requests from the same IP address in Varnish.

@Reedy Can you elaborate? The only rate-limiting I'm aware of is the simultaneous edit limiting provided by PoolCounter, and the limited support for limiting simultaneous requests from the same IP address in Varnish.

https://www.mediawiki.org/wiki/Manual:$wgRateLimits

While I concur it's not exactly what we necessarily want (ie preventing excessive simultaneous requests), it wouldn't be unreasonable to add something to that, and the code to use it to prevent X requests to a page in Y time

Reedy updated the task description. (Show Details)Tue, Oct 22, 10:11 PM
CDanis updated the task description. (Show Details)Wed, Oct 23, 1:34 PM
Krinkle renamed this task from Some extreme requests for Special:Contributions are emitting database query errors or WMFTimeoutException to Some Special:Contributions requests cause "Error: 0" from database or WMFTimeoutException.Wed, Oct 23, 7:44 PM
Krinkle updated the task description. (Show Details)

A community member noticed this today; seen in #wikimedia-tech on IRC:

12:24:33	<Nemo_bis>	Since when a limit > 500 no longer works in Special:Contributions? https://it.wikipedia.org/w/index.php?title=Speciale:Contributi&limit=550&target=Nemo+bis

It'd be great to at least un-protect this issue.

Reedy added a comment.Sun, Oct 27, 4:34 PM

A community member noticed this today; seen in #wikimedia-tech on IRC:

12:24:33	<Nemo_bis>	Since when a limit > 500 no longer works in Special:Contributions? https://it.wikipedia.org/w/index.php?title=Speciale:Contributi&limit=550&target=Nemo+bis

It'd be great to at least un-protect this issue.

Would we need to communicate such a change with the community?

History says it won't take long before someone notices and reports a "limit=foo isn't returning foo results" bug.

:)

Reedy changed the visibility from "Custom Policy" to "Public (No Login Required)".Sun, Oct 27, 4:35 PM

It's back: {P9500}

https://logstash.wikimedia.org/goto/a1dee7e0b0ea9c2a190923608df25980
https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1572365673468&to=1572369099222&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All

I think it is the query, not the limit. It didn't help that we were at 50% resources due to maintenance on db1099, but for regular traffic that is more than enough.

JJMC89 added a subscriber: JJMC89.Wed, Oct 30, 12:20 AM

5000 results via the webui are very useful when researching things, but certainly don't need to be performed at any high rate of speed. Could access to this limit be restored and perhaps limited to having a certain user permission such as (noratelimit) ?

I often want to search through more than 500 items at a time. I just click on the "next 500" that's available in the U/I, then add an extra 0 to the URL. Not being able to do that is a problem.

There's a thread going on at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Changes_to_Special:Contributions where several people are expressing the same opinion.

If the issue is, "Occasional 5000 queries are OK, it's when we get hit with masses of them at once", then it seems like some sort of rate limit would be the answer. But, if I'm writing some automation that has to access the full history of something, is it really any more efficient to page through the history 500 items at a time, compared to 1/10th as many queries 5000 at a time?

If the issue is, "Occasional 5000 queries are OK, it's when we get hit with masses of them at once", then it seems like some sort of rate limit would be the answer. But, if I'm writing some automation that has to access the full history of something, is it really any more efficient to page through the history 500 items at a time, compared to 1/10th as many queries 5000 at a time?

If you are building automated tools, Have you looked at the replica dbs on WMF Labs?

My comment about the automated tools was an afterthought. My main point was that as a human exploring user and article histories, being limited to 500 per page is a real problem.

TheDJ added a subscriber: TheDJ.Wed, Oct 30, 1:30 PM

I think that adding a 'pager query' category to wgRateLimits that is used for these types of requests would be the most ideal.

That assumes that the rate is actually the problem, and not some sort of deeper cause to the specific query as @jcrespo seems to be pondering.

This new limit is a significant pain in the ass. I'd be fine with throttling how often it can be used, but when browsing my own [over one quarter million] contributions, or any other highly-active editor's contributions [like a bot's], the ability to see more than 500 contributions at a time is a great time saver. Sometimes all I know if that I/a bot edited something in early 2019, but if I interpret that to mean January-March 2019, that can span 20,000 edits easily. Browsing 4 pages of edits is way better than browsing 40 such pages.

MER-C added a subscriber: MER-C.Thu, Oct 31, 10:29 AM

Why not make the higher limits a privileged action, just like it is in the API?

SD0001 added a subscriber: SD0001.Fri, Nov 1, 7:41 AM

Why not make the higher limits a privileged action, just like it is in the API?

Why should that be? It's not just admins who would be benefited from being able to see >500 revisions per page.

In very large wikis users may still be able to make timeout queries such as P9520 even if the limit is 50 (or even 1).

Jc86035 added a subscriber: Jc86035.Mon, Nov 4, 3:20 PM

CPT: please take a new look, thanks :)

I think it would be interesting to try adding a per-user concurrency limit of say 2, enforced with PoolCounter. I don't think PoolCounter has been used in that way before, but I think it should work. I'm assuming the slow query times were caused by an overload, which was in turn caused by high concurrency, which seems to be the conclusion of the comments above, but I haven't verified that. I'm not sure if also adding rate limits would be useful -- a concurrency limit implicitly limits the rate. The only advantage of a rate limit would be to prevent large numbers of cheap queries, but I'm not sure if it's worth preventing that.

If the user is not logged in, we would use the IP address in the PoolCounter key instead of the username, which brings with it the risk of unintentionally limiting shared IPs. It's better than the alternatives, I just think it affects how we tune the "maxqueue" configuration variable and how we decide what error message to deliver when maxqueue is exceeded.

To get more specific about the implementation: PoolCounterWorkViaCallback can be used. ContribsPager would supply an error callback, which is called with a Status object. If there was an error, it should show an error page, wrapping the message from the Status in some message specific to ContribsPager. For page views, the wrapper message is "Sorry, the servers are overloaded at the moment. Too many users are trying to view this page. Please wait a while before you try to access this page again." For this per-user concurrency limit, the message could instead say that too many requests are being made with your username or from your IP address.

Kaartic added a subscriber: Kaartic.Fri, Nov 8, 5:25 AM
jcrespo added a comment.EditedFri, Nov 8, 8:43 AM

@tstarling: Do you know if poolcounter could return 429 or other custom HTTP error? 429 is what is returned at cache layer for misbehaving robots (at a much larger concurrency).

Thanks for your feedback.

Bawolff added a subscriber: Bawolff.Fri, Nov 8, 1:31 PM

I think it would be interesting to try adding a per-user concurrency limit of say 2, enforced with PoolCounter. I don't think PoolCounter has been used in that way before, but I think it should work.

Its sort of similar to how the slots parameter works with image rendering.

Reedy added a comment.Fri, Nov 8, 2:32 PM

@tstarling: Do you know if poolcounter could return 429 or other custom HTTP error? 429 is what is returned at cache layer for misbehaving robots (at a much larger concurrency).
Thanks for your feedback.

MW tells PoolCounter what error (With the status object from PoolCounter with the error message inside) to return in case of an error, and then does the error handling. In theory that’s possible

Eg https://github.com/wikimedia/mediawiki/blob/8a4917968baece3fccf9428ea7fe7e71be5b3858/includes/diff/TextSlotDiffRenderer.php#L144-L159

CDanis added a comment.EditedFri, Nov 8, 3:47 PM

+1 to @tstarling's proposal.

My impression is also that concurrency limits should be sufficient in this case.

The only thing I have to add is that we should make sure we have good observability of when we're throttling, ideally in both logs and metrics.

Oh, and that the concurrency limit should be a configuration variable, rather than hardcoded :)

It's back: {P9500}
https://logstash.wikimedia.org/goto/a1dee7e0b0ea9c2a190923608df25980
https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1572365673468&to=1572369099222&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All
I think it is the query, not the limit. It didn't help that we were at 50% resources due to maintenance on db1099, but for regular traffic that is more than enough.

I think it is the combination of both.
The query itself with the 500 isn't super heavy: (it takes around 0.10-0.30) to run, but as I mentioned on T234450#5566351 if we get thousands of queries without any concurrency limit, we are going to hit issues anyways and despite of how fast/slow the query is :(

I think this change is a really retrograde step, and I hope ways to return it can be found. It is only rarely that I need to change the 500 limit to 5000, but my goodness, when I do so it's so valuable. Having to copy and paste entries into Excel before I can work through a moderately large number of edits would be a real pain. I'd even be happy to enter a Captcha code or something similar if thousands of automated requests could potentially bring the system to its knees, so we can be sure that big searches are only done for a good reason and by a real person. Please find a way of returning this valuable feature.

Xaosflux added a comment.EditedSat, Nov 16, 2:42 PM

Any updates on this? It has been weeks, the userbase is clearly saying that useful functionality has been degraded and this ticket isn't even assigned to anyone.

Why not a Tool that mimics a 5000 (or larger) request using the API stitching together 500 blocks. This tool could be linked at the top of the commonly used pages like User Contributions. It could handle DoS by using Oauth and not accepting more than 1 query per user at a time.

If a tool is created it should directly query the database (replica have near-infinite time limit) instead of calling API.

I think @Green_Cardamom is suggesting a userland "tool" (likely client-side javascript API shunt) to workaround this new limitation, such would not have any sort of direct database access?

AlexisJazz added a subscriber: AlexisJazz.EditedSun, Nov 17, 8:09 AM

I used this on a regular basis on Commons. The tools to search contributions are frequently inadequate for my purposes (as in: ctrl+f is way better) and/or I want some statistics combined with the number of contributions on Commons being huge compared to Wikipedia. If FlickreviewR messed up 2000 files, I need a damn list. I can see this making little sense on other wikis though.

I'm no bot btw.

I'm no bot btw.

Why would we think that?

I think @Green_Cardamom is suggesting a userland "tool" (likely client-side javascript API shunt) to workaround this new limitation, such would not have any sort of direct database access?

I think they're suggesting an external tool (likely hosted on Toolforge) which can indeed query the replica databases directly.