Page MenuHomePhabricator

Some Special:Contributions requests cause "Error: 0" from database or WMFTimeoutException
Open, HighPublic

Description

Error message
A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? 

Function: IndexPager::buildQueryInfo (contributions page unfiltered)
Error: 0
Impact

~1400 errors an hour in logstash; this is by far the main production error right now.

Notes

Request of limit of 50000(!) (which MW moderated down to 5000) and timerange of 2005–2017 suggests this is a query from a badly-behaved bot or scraper. The UA used by this client was just an unchanged-from-defaults python-requests/x.y.z. In the future we should find time to heavily ratelimit or block default User-Agents per our own unenforced policy.

Details

Request ID
AW2NGujwx3rdj6D82Ozi
Request URL
en.wikipedia.org/w/index.php?limit=50000&title=Special%3AContributions&contribs=user&target=AxG&namespace=&tagfilter=&start=2005-01-01&end=2017-06-30
Stack Trace
#0 /srv/mediawiki/php-1.34.0-wmf.24/includes/libs/rdbms/database/Database.php(1573): Wikimedia\Rdbms\Database->getQueryExceptionAndLog(NULL, integer, string, string)
#1 /srv/mediawiki/php-1.34.0-wmf.24/includes/libs/rdbms/database/Database.php(1152): Wikimedia\Rdbms\Database->reportQueryError(NULL, integer, string, string, boolean)
#2 /srv/mediawiki/php-1.34.0-wmf.24/includes/libs/rdbms/database/Database.php(1806): Wikimedia\Rdbms\Database->query(string, string)
#3 /srv/mediawiki/php-1.34.0-wmf.24/includes/libs/rdbms/database/DBConnRef.php(68): Wikimedia\Rdbms\Database->select(array, array, array, string, array, array)
#4 /srv/mediawiki/php-1.34.0-wmf.24/includes/libs/rdbms/database/DBConnRef.php(315): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#5 /srv/mediawiki/php-1.34.0-wmf.24/includes/specials/pagers/ContribsPager.php(205): Wikimedia\Rdbms\DBConnRef->select(array, array, array, string, array, array)
#6 /srv/mediawiki/php-1.34.0-wmf.24/includes/pager/IndexPager.php(263): ContribsPager->reallyDoQuery(string, integer, boolean)
#7 /srv/mediawiki/php-1.34.0-wmf.24/includes/pager/IndexPager.php(609): IndexPager->doQuery()
#8 /srv/mediawiki/php-1.34.0-wmf.24/includes/specials/SpecialContributions.php(226): IndexPager->getNumRows()
#9 /srv/mediawiki/php-1.34.0-wmf.24/includes/specialpage/SpecialPage.php(575): SpecialContributions->execute(NULL)
#10 /srv/mediawiki/php-1.34.0-wmf.24/includes/specialpage/SpecialPageFactory.php(611): SpecialPage->run(NULL)
#11 /srv/mediawiki/php-1.34.0-wmf.24/includes/MediaWiki.php(296): MediaWiki\Special\SpecialPageFactory->executePath(Title, RequestContext)
#12 /srv/mediawiki/php-1.34.0-wmf.24/includes/MediaWiki.php(896): MediaWiki->performRequest()
#13 /srv/mediawiki/php-1.34.0-wmf.24/includes/MediaWiki.php(527): MediaWiki->main()
#14 /srv/mediawiki/php-1.34.0-wmf.24/index.php(44): MediaWiki->run()
#15 /srv/mediawiki/w/index.php(3): require(string)
#16 {main}
Related Gerrit Patches:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
TheDJ added a subscriber: TheDJ.Oct 30 2019, 1:30 PM

I think that adding a 'pager query' category to wgRateLimits that is used for these types of requests would be the most ideal.

That assumes that the rate is actually the problem, and not some sort of deeper cause to the specific query as @jcrespo seems to be pondering.

This new limit is a significant pain in the ass. I'd be fine with throttling how often it can be used, but when browsing my own [over one quarter million] contributions, or any other highly-active editor's contributions [like a bot's], the ability to see more than 500 contributions at a time is a great time saver. Sometimes all I know if that I/a bot edited something in early 2019, but if I interpret that to mean January-March 2019, that can span 20,000 edits easily. Browsing 4 pages of edits is way better than browsing 40 such pages.

MER-C added a subscriber: MER-C.Oct 31 2019, 10:29 AM

Why not make the higher limits a privileged action, just like it is in the API?

SD0001 added a subscriber: SD0001.Nov 1 2019, 7:41 AM

Why not make the higher limits a privileged action, just like it is in the API?

Why should that be? It's not just admins who would be benefited from being able to see >500 revisions per page.

In very large wikis users may still be able to make timeout queries such as P9520 even if the limit is 50 (or even 1).

Jc86035 added a subscriber: Jc86035.Nov 4 2019, 3:20 PM

CPT: please take a new look, thanks :)

I think it would be interesting to try adding a per-user concurrency limit of say 2, enforced with PoolCounter. I don't think PoolCounter has been used in that way before, but I think it should work. I'm assuming the slow query times were caused by an overload, which was in turn caused by high concurrency, which seems to be the conclusion of the comments above, but I haven't verified that. I'm not sure if also adding rate limits would be useful -- a concurrency limit implicitly limits the rate. The only advantage of a rate limit would be to prevent large numbers of cheap queries, but I'm not sure if it's worth preventing that.

If the user is not logged in, we would use the IP address in the PoolCounter key instead of the username, which brings with it the risk of unintentionally limiting shared IPs. It's better than the alternatives, I just think it affects how we tune the "maxqueue" configuration variable and how we decide what error message to deliver when maxqueue is exceeded.

To get more specific about the implementation: PoolCounterWorkViaCallback can be used. ContribsPager would supply an error callback, which is called with a Status object. If there was an error, it should show an error page, wrapping the message from the Status in some message specific to ContribsPager. For page views, the wrapper message is "Sorry, the servers are overloaded at the moment. Too many users are trying to view this page. Please wait a while before you try to access this page again." For this per-user concurrency limit, the message could instead say that too many requests are being made with your username or from your IP address.

Kaartic added a subscriber: Kaartic.Nov 8 2019, 5:25 AM
jcrespo added a comment.EditedNov 8 2019, 8:43 AM

@tstarling: Do you know if poolcounter could return 429 or other custom HTTP error? 429 is what is returned at cache layer for misbehaving robots (at a much larger concurrency).

Thanks for your feedback.

Bawolff added a subscriber: Bawolff.Nov 8 2019, 1:31 PM

I think it would be interesting to try adding a per-user concurrency limit of say 2, enforced with PoolCounter. I don't think PoolCounter has been used in that way before, but I think it should work.

Its sort of similar to how the slots parameter works with image rendering.

Reedy added a comment.Nov 8 2019, 2:32 PM

@tstarling: Do you know if poolcounter could return 429 or other custom HTTP error? 429 is what is returned at cache layer for misbehaving robots (at a much larger concurrency).
Thanks for your feedback.

MW tells PoolCounter what error (With the status object from PoolCounter with the error message inside) to return in case of an error, and then does the error handling. In theory that’s possible

Eg https://github.com/wikimedia/mediawiki/blob/8a4917968baece3fccf9428ea7fe7e71be5b3858/includes/diff/TextSlotDiffRenderer.php#L144-L159

CDanis added a comment.EditedNov 8 2019, 3:47 PM

+1 to @tstarling's proposal.

My impression is also that concurrency limits should be sufficient in this case.

The only thing I have to add is that we should make sure we have good observability of when we're throttling, ideally in both logs and metrics.

Oh, and that the concurrency limit should be a configuration variable, rather than hardcoded :)

It's back: {P9500}
https://logstash.wikimedia.org/goto/a1dee7e0b0ea9c2a190923608df25980
https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1572365673468&to=1572369099222&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All
I think it is the query, not the limit. It didn't help that we were at 50% resources due to maintenance on db1099, but for regular traffic that is more than enough.

I think it is the combination of both.
The query itself with the 500 isn't super heavy: (it takes around 0.10-0.30) to run, but as I mentioned on T234450#5566351 if we get thousands of queries without any concurrency limit, we are going to hit issues anyways and despite of how fast/slow the query is :(

I think this change is a really retrograde step, and I hope ways to return it can be found. It is only rarely that I need to change the 500 limit to 5000, but my goodness, when I do so it's so valuable. Having to copy and paste entries into Excel before I can work through a moderately large number of edits would be a real pain. I'd even be happy to enter a Captcha code or something similar if thousands of automated requests could potentially bring the system to its knees, so we can be sure that big searches are only done for a good reason and by a real person. Please find a way of returning this valuable feature.

Xaosflux added a comment.EditedNov 16 2019, 2:42 PM

Any updates on this? It has been weeks, the userbase is clearly saying that useful functionality has been degraded and this ticket isn't even assigned to anyone.

Why not a Tool that mimics a 5000 (or larger) request using the API stitching together 500 blocks. This tool could be linked at the top of the commonly used pages like User Contributions. It could handle DoS by using Oauth and not accepting more than 1 query per user at a time.

If a tool is created it should directly query the database (replica have near-infinite time limit) instead of calling API.

I think @Green_Cardamom is suggesting a userland "tool" (likely client-side javascript API shunt) to workaround this new limitation, such would not have any sort of direct database access?

AlexisJazz added a subscriber: AlexisJazz.EditedNov 17 2019, 8:09 AM

I used this on a regular basis on Commons. The tools to search contributions are frequently inadequate for my purposes (as in: ctrl+f is way better) and/or I want some statistics combined with the number of contributions on Commons being huge compared to Wikipedia. If FlickreviewR messed up 2000 files, I need a damn list. I can see this making little sense on other wikis though.

I'm no bot btw.

I'm no bot btw.

Why would we think that?

I think @Green_Cardamom is suggesting a userland "tool" (likely client-side javascript API shunt) to workaround this new limitation, such would not have any sort of direct database access?

I think they're suggesting an external tool (likely hosted on Toolforge) which can indeed query the replica databases directly.

How up to date are the replicas? When investigating abuse, you need the full history, including what has happened recently. If you need to use one tool to see the current data and another to see the historical record, that unworkable.

AlexisJazz added a comment.EditedNov 17 2019, 4:32 PM

I'm no bot btw.

Why would we think that?

The task description was suggesting I'm "a badly-behaved bot or scraper", though I never requested 50000 (because I knew the limit was 5000) and don't use pyton-requests/x.y.z. (I think)

@Jdforrester-WMF I think the use case for viewing >500 contributions is much greater on Commons due to the sheer volume of edits. Combined with cat-a-lot, it is sometimes the only way to do some things. Was there also scraper abuse on Commons? If not, any chance of increasing the maximum for Commons, if not to 5000 perhaps to 2000? I also wouldn't mind if this was limited to autopatrolled. (only power users do this anyway)

(for avoidance of doubt: this reply is written in my staff capacity)

Any updates on this? It has been weeks, the userbase is clearly saying that useful functionality has been degraded

The current plan is to try using PoolCounter to limit concurrency, per T234450#5646597, and raise the limit back to 5000.

I was asked to look at implementing that at the end of last week. Assuming no one beats me to it (and it's pretty likely no one will), I plan to look at this once I finish code reviews and testing for the project I'm currently doing code review on. Hopefully that means early this coming week.

If someone does manage to beat me to it, I'll review their code instead.

and this ticket isn't even assigned to anyone.

The "assigned" field in my experience is more often used to claim work being actively worked on than to actually "assign" future work. This tends to avoid problems with "cookie licking".

About as often it's not used at all until the task is closed, at which point it's finally used after the fact to credit the person who did the work.

How up to date are the replicas?

Very. You can see current status at https://tools.wmflabs.org/replag/.

The task description was suggestion I'm "a badly-behaved bot or scraper", though I never requested 50000 (because I knew the limit was 5000) and don't use pyton-requests/x.y.z. (I think)

Why do you think this task is referring to your behavior? If you never requested 50000 and don't use python, it's much more likely it was someone else.

Change 551909 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] SpecialContributions: Use PoolCounter to limit concurrency

https://gerrit.wikimedia.org/r/551909

Change 551909 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] SpecialContributions: Use PoolCounter to limit concurrency
https://gerrit.wikimedia.org/r/551909

This patch should make it possible to limit the concurrency by user/IP. A change to wmf-config/PoolCounterSettings.php in operations/mediawiki-config would still be needed to set the appropriate limits. I have no idea what those should be.

Thanks @Anomie !

Tim suggested 2 as a concurrency limit. I think we can start less conservative than that, though -- let's say 10? It feels pretty hard for that to hurt normal user traffic, while it should still prevent excessive usage.

And, once this is live we can roll back the clamp-at-500-revisions behavior as well.

Change 551909 merged by jenkins-bot:
[mediawiki/core@master] SpecialContributions: Use PoolCounter to limit concurrency

https://gerrit.wikimedia.org/r/551909

Current status:

Sure, I can create the configuration patch.

I assume that there's no need to wait for the new code to be deployed before merging the update to PoolCounterSettings, correct?

I assume that there's no need to wait for the new code to be deployed before merging the update to PoolCounterSettings, correct?

You assume correctly. The only drawback is that you'll not be able to test it until the train runs.

I assume that there's no need to wait for the new code to be deployed before merging the update to PoolCounterSettings, correct?

You assume correctly. The only drawback is that you'll not be able to test it until the train runs.

Of course, we could backport to get the code live sooner, else it'll be live everywhere sooner. Which may be prudent, as per the Deployment calendar:

Week of November 25th
We will branch and deploy to group0 on Tuesday the 26th
November 27–29 - No deploys, US Thanksgiving (28-29)

Change 552224 had a related patch set uploaded (by Reedy; owner: Anomie):
[mediawiki/core@wmf/1.35.0-wmf.5] SpecialContributions: Use PoolCounter to limit concurrency

https://gerrit.wikimedia.org/r/552224

Change 552224 merged by jenkins-bot:
[mediawiki/core@wmf/1.35.0-wmf.5] SpecialContributions: Use PoolCounter to limit concurrency

https://gerrit.wikimedia.org/r/552224

Change 552228 had a related patch set uploaded (by Reedy; owner: Reedy):
[operations/mediawiki-config@master] Add PoolCounter configuration for Special:Contributions

https://gerrit.wikimedia.org/r/552228

Change 552228 merged by jenkins-bot:
[operations/mediawiki-config@master] Add PoolCounter configuration for Special:Contributions

https://gerrit.wikimedia.org/r/552228

Reedy closed this task as Resolved.Nov 22 2019, 10:01 AM
Reedy assigned this task to Anomie.

Deployed

We might want to tune the parameters a bit to make it a bit more strict.
We were hit again by this today on enwiki, (interestingly again when one of the special slaves was depooled for maintenance).
The IP of the request was coming from the same place as the one described at T234514#5543966. We had to ban two IPs of that range for a few days till this was worked out (we also emailed their abuse@ but with no answer).
Same query and same patterns that were seeing at T234514

Headbomb removed a subscriber: Headbomb.Nov 26 2019, 2:13 PM
Reedy added a comment.Nov 26 2019, 2:26 PM

We might want to tune the parameters a bit to make it a bit more strict.
We were hit again by this today on enwiki, (interestingly again when one of the special slaves was depooled for maintenance).
The IP of the request was coming from the same place as the one described at T234514#5543966. We had to ban two IPs of that range for a few days till this was worked out (we also emailed their abuse@ but with no answer).
Same query and same patterns that were seeing at T234514

How many were they actually doing/triggering?

From what I can see on my buffer, 13 running at the same time (at least when I captured it)

jcrespo reopened this task as Open.EditedNov 27 2019, 12:04 PM

{P9764}

(see image on comment there for full query)

I can see around 75 running at the same time.

And same IP as yesterday and all the previous cases.

Change 553360 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/mediawiki-config@master] SpecialContributions: max concurrency 3 (instead of 10)

https://gerrit.wikimedia.org/r/553360

Marostegui added a subtask: Restricted Task.Nov 27 2019, 4:20 PM

Change 553360 merged by CDanis:
[operations/mediawiki-config@master] SpecialContributions: max concurrency 3 (instead of 10)

https://gerrit.wikimedia.org/r/553360

I've reapplied and deployed the "500 limit patch" from T234450#5595510 to wmf.5 as a (hopefully) very temporary measure while we continue to further troubleshoot this issue and tweak the PoolCounter solution.

I've reapplied and deployed the "500 limit patch" from T234450#5595510 to wmf.5 as a (hopefully) very temporary measure while we continue to further troubleshoot this issue and tweak the PoolCounter solution.

Thanks Scott, seems like a sensible thing to do before the US long weekend.

I've become a bit skeptical of the PoolCounter solution here being sufficient, after learning that there's not a direct connection between a Mediawiki appserver giving up on a request (e.g. due to timeout) and the cancellation of the corresponding MySQL query (which can continue to execute for some time afterwards). At the very least, the concurrency limits we set in PoolCounter will have to be artificially tight as a result. I'm guessing that fixing the underlying issue here is probably difficult, but I think it does seem worthy of some CPT time in the future.

Thanks Scott, seems like a sensible thing to do before the US long weekend.

That was my thinking as well.

I've become a bit skeptical of the PoolCounter solution here being sufficient, after learning that there's not a direct connection between a Mediawiki appserver giving up on a request (e.g. due to timeout) and the cancellation of the corresponding MySQL query (which can continue to execute for some time afterwards). At the very least, the concurrency limits we set in PoolCounter will have to be artificially tight as a result. I'm guessing that fixing the underlying issue here is probably difficult, but I think it does seem worthy of some CPT time in the future.

I'm happy to remove the security patch (or really anybody with deployment can) if there's a need to production-test PoolCounter config or other/additional solutions this week.

Anomie added a comment.Dec 2 2019, 4:55 PM

I'm guessing that fixing the underlying issue here [that there's not a direct connection between a Mediawiki appserver giving up on a request (e.g. due to timeout) and the cancellation of the corresponding MySQL query] is probably difficult, but I think it does seem worthy of some CPT time in the future.

I'd guess this would take a patch in MySQL/MariaDB to kill the query on client disconnect. At a quick glance through the docs, there doesn't seem to be any option we could set for this in PHP-land, and https://dev.mysql.com/worklog/task/?id=1242 being open (and unassigned) suggests the feature doesn't exist yet in MySQL.

T149421 seems relevant as an existing request for similar work. There's also T195792 which requests an ability to pass an app-level timeout through to MySQL/MariaDB, which could somewhat mitigate the problem[1] if MediaWiki actually knows an appropriate timeout to specify.

[1]: It's unlikely MediaWiki would be able to inform the DB of the actual time it has remaining, but even a static limit passed through to the DB would be an improvement.

I know ratelimits are applied to stuff like page previews. Can't you also apply a ratelimit (say, 6 per minute per IP? is that enough to stop things from breaking?) to these 5000-queries?

I've only just now discovered that >500 searches on user contributions is now limited to 500. I find it very useful from time to time to be able to bring up a person's contributions beyond 500 per page.

Please, PLEASE restore this ability. If you want to throttle it to X amount of Y period, fine, but please restore this ability. Thank you.

Reedy added a comment.Wed, Jan 1, 5:24 PM

I've only just now discovered that >500 searches on user contributions is now limited to 500. I find it very useful from time to time to be able to bring up a person's contributions beyond 500 per page.
Please, PLEASE restore this ability. If you want to throttle it to X amount of Y period, fine, but please restore this ability. Thank you.

As above, it was tried and didn’t work as expected. Hence the task being still open