Page MenuHomePhabricator

Special:Contributions requests with a high &limit= caused excessive database load
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error message
A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? 

Function: IndexPager::buildQueryInfo (contributions page unfiltered)
Error: 0
Impact

~1400 errors an hour in logstash; this is by far the main production error right now.

Notes

Request of limit of 50000(!) (which MW moderated down to 5000) and timerange of 2005–2017 suggests this is a query from a badly-behaved bot or scraper. The UA used by this client was just an unchanged-from-defaults python-requests/x.y.z. In the future we should find time to heavily ratelimit or block default User-Agents per our own unenforced policy.

Details

Request ID
AW2NGujwx3rdj6D82Ozi
Request URL
en.wikipedia.org/w/index.php?limit=50000&title=Special%3AContributions&contribs=user&target=AxG&namespace=&tagfilter=&start=2005-01-01&end=2017-06-30
Stack Trace
#0 /srv/mediawiki/php-1.34.0-wmf.24/includes/libs/rdbms/database/Database.php(1573): Wikimedia\Rdbms\Database->getQueryExceptionAndLog(NULL, integer, string, string)
#1 /srv/mediawiki/php-1.34.0-wmf.24/includes/libs/rdbms/database/Database.php(1152): Wikimedia\Rdbms\Database->reportQueryError(NULL, integer, string, string, boolean)
#2 /srv/mediawiki/php-1.34.0-wmf.24/includes/libs/rdbms/database/Database.php(1806): Wikimedia\Rdbms\Database->query(string, string)
#3 /srv/mediawiki/php-1.34.0-wmf.24/includes/libs/rdbms/database/DBConnRef.php(68): Wikimedia\Rdbms\Database->select(array, array, array, string, array, array)
#4 /srv/mediawiki/php-1.34.0-wmf.24/includes/libs/rdbms/database/DBConnRef.php(315): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#5 /srv/mediawiki/php-1.34.0-wmf.24/includes/specials/pagers/ContribsPager.php(205): Wikimedia\Rdbms\DBConnRef->select(array, array, array, string, array, array)
#6 /srv/mediawiki/php-1.34.0-wmf.24/includes/pager/IndexPager.php(263): ContribsPager->reallyDoQuery(string, integer, boolean)
#7 /srv/mediawiki/php-1.34.0-wmf.24/includes/pager/IndexPager.php(609): IndexPager->doQuery()
#8 /srv/mediawiki/php-1.34.0-wmf.24/includes/specials/SpecialContributions.php(226): IndexPager->getNumRows()
#9 /srv/mediawiki/php-1.34.0-wmf.24/includes/specialpage/SpecialPage.php(575): SpecialContributions->execute(NULL)
#10 /srv/mediawiki/php-1.34.0-wmf.24/includes/specialpage/SpecialPageFactory.php(611): SpecialPage->run(NULL)
#11 /srv/mediawiki/php-1.34.0-wmf.24/includes/MediaWiki.php(296): MediaWiki\Special\SpecialPageFactory->executePath(Title, RequestContext)
#12 /srv/mediawiki/php-1.34.0-wmf.24/includes/MediaWiki.php(896): MediaWiki->performRequest()
#13 /srv/mediawiki/php-1.34.0-wmf.24/includes/MediaWiki.php(527): MediaWiki->main()
#14 /srv/mediawiki/php-1.34.0-wmf.24/index.php(44): MediaWiki->run()
#15 /srv/mediawiki/w/index.php(3): require(string)
#16 {main}

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Reedy added a comment.Nov 26 2019, 2:26 PM

We might want to tune the parameters a bit to make it a bit more strict.
We were hit again by this today on enwiki, (interestingly again when one of the special slaves was depooled for maintenance).
The IP of the request was coming from the same place as the one described at T234514#5543966. We had to ban two IPs of that range for a few days till this was worked out (we also emailed their abuse@ but with no answer).
Same query and same patterns that were seeing at T234514

How many were they actually doing/triggering?

From what I can see on my buffer, 13 running at the same time (at least when I captured it)

jcrespo reopened this task as Open.EditedNov 27 2019, 12:04 PM

{P9764}

(see image on comment there for full query)

I can see around 75 running at the same time.

And same IP as yesterday and all the previous cases.

Change 553360 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/mediawiki-config@master] SpecialContributions: max concurrency 3 (instead of 10)

https://gerrit.wikimedia.org/r/553360

Marostegui added a subtask: Restricted Task.Nov 27 2019, 4:20 PM

Change 553360 merged by CDanis:
[operations/mediawiki-config@master] SpecialContributions: max concurrency 3 (instead of 10)

https://gerrit.wikimedia.org/r/553360

I've reapplied and deployed the "500 limit patch" from T234450#5595510 to wmf.5 as a (hopefully) very temporary measure while we continue to further troubleshoot this issue and tweak the PoolCounter solution.

I've reapplied and deployed the "500 limit patch" from T234450#5595510 to wmf.5 as a (hopefully) very temporary measure while we continue to further troubleshoot this issue and tweak the PoolCounter solution.

Thanks Scott, seems like a sensible thing to do before the US long weekend.

I've become a bit skeptical of the PoolCounter solution here being sufficient, after learning that there's not a direct connection between a Mediawiki appserver giving up on a request (e.g. due to timeout) and the cancellation of the corresponding MySQL query (which can continue to execute for some time afterwards). At the very least, the concurrency limits we set in PoolCounter will have to be artificially tight as a result. I'm guessing that fixing the underlying issue here is probably difficult, but I think it does seem worthy of some CPT time in the future.

Thanks Scott, seems like a sensible thing to do before the US long weekend.

That was my thinking as well.

I've become a bit skeptical of the PoolCounter solution here being sufficient, after learning that there's not a direct connection between a Mediawiki appserver giving up on a request (e.g. due to timeout) and the cancellation of the corresponding MySQL query (which can continue to execute for some time afterwards). At the very least, the concurrency limits we set in PoolCounter will have to be artificially tight as a result. I'm guessing that fixing the underlying issue here is probably difficult, but I think it does seem worthy of some CPT time in the future.

I'm happy to remove the security patch (or really anybody with deployment can) if there's a need to production-test PoolCounter config or other/additional solutions this week.

Anomie added a comment.Dec 2 2019, 4:55 PM

I'm guessing that fixing the underlying issue here [that there's not a direct connection between a Mediawiki appserver giving up on a request (e.g. due to timeout) and the cancellation of the corresponding MySQL query] is probably difficult, but I think it does seem worthy of some CPT time in the future.

I'd guess this would take a patch in MySQL/MariaDB to kill the query on client disconnect. At a quick glance through the docs, there doesn't seem to be any option we could set for this in PHP-land, and https://dev.mysql.com/worklog/task/?id=1242 being open (and unassigned) suggests the feature doesn't exist yet in MySQL.

T149421 seems relevant as an existing request for similar work. There's also T195792 which requests an ability to pass an app-level timeout through to MySQL/MariaDB, which could somewhat mitigate the problem[1] if MediaWiki actually knows an appropriate timeout to specify.

[1]: It's unlikely MediaWiki would be able to inform the DB of the actual time it has remaining, but even a static limit passed through to the DB would be an improvement.

I know ratelimits are applied to stuff like page previews. Can't you also apply a ratelimit (say, 6 per minute per IP? is that enough to stop things from breaking?) to these 5000-queries?

I've only just now discovered that >500 searches on user contributions is now limited to 500. I find it very useful from time to time to be able to bring up a person's contributions beyond 500 per page.

Please, PLEASE restore this ability. If you want to throttle it to X amount of Y period, fine, but please restore this ability. Thank you.

Reedy added a comment.Jan 1 2020, 5:24 PM

I've only just now discovered that >500 searches on user contributions is now limited to 500. I find it very useful from time to time to be able to bring up a person's contributions beyond 500 per page.

Please, PLEASE restore this ability. If you want to throttle it to X amount of Y period, fine, but please restore this ability. Thank you.

As above, it was tried and didn’t work as expected. Hence the task being still open

daniel removed Anomie as the assignee of this task.Apr 14 2020, 8:49 PM
daniel moved this task from Backlog to Later on the Platform Team Workboards (Clinic Duty Team) board.
daniel added a subscriber: daniel.Apr 15 2020, 4:30 PM

Is there anything left for CPT to do here? Or can this even be closed?

Is there anything left for CPT to do here? Or can this even be closed?

Sadly, the patch from T234450#5595510 (Oct 22 2019, 9:39 AM) that was supposed to be temporary is still on wmf.28:

* 948afc7ecf (HEAD -> wmf/1.35.0-wmf.28) SECURITY: ContribsPager: Max limit 500, not 5000

From various comments above, I don't believe the underlying issue was ever resolved, so as @Reedy noted:

As above, it was tried and didn’t work as expected. Hence the task being still open

If Platform Engineering doesn't want to work on this, that's fine, though I'm not certain who else would at this point.

AlexisJazz added a comment.EditedApr 17 2020, 11:45 AM

@Jdforrester-WMF @Marostegui @tstarling
I wonder if it has been considered to create functionality to temporarily return a link to the dumps instead of the actual result to any given combination of IP and user agent?

Because assuming good faith, it's just a scraper that's operated by someone who is unaware of the existence of dumps. If this idea isn't completely stupid I can open a new task.

SD0001 removed a subscriber: SD0001.Apr 17 2020, 11:48 AM

If I understand correctly, the issue here is long running DB queries, in particular queries that continue to run after the app server as already given up on the request. Could we not make use of MariaDB's max_statement_time feature to ensure these queries get aborted? The user would still be hit with an ugly DB level error, but at least we'd be freeing up resources on the DB servers... This feature seems pretty nice, since limits can be not just per connection, but even per query. I note that T195792: Create Mediawiki DB abstraction for individual query timeouts has been idling around for a while. It used to be waiting for us to upgrade to Maria 10.1. Seems like that was completed about a year ago.

If Core Platform Team doesn't want to work on this, that's fine, though I'm not certain who else would at this point.

The problem is rather that the solution is unclear at this point...

If I understand correctly, the issue here is long running DB queries, in particular queries that continue to run after the app server as already given up on the request. Could we not make use of MariaDB's max_statement_time feature to ensure these queries get aborted? The user would still be hit with an ugly DB level error, but at least we'd be freeing up resources on the DB servers... This feature seems pretty nice, since limits can be not just per connection, but even per query. I note that T195792: Create Mediawiki DB abstraction for individual query timeouts has been idling around for a while. It used to be waiting for us to upgrade to Maria 10.1. Seems like that was completed about a year ago.

That doesn't address all of the concerns of this ticket, but I do think it would be a big help -- particularly the concern of database servers being overwhelmed because of 'runaway' queries that take disproportionate amounts of DB time, even after hitting timeouts at the traffic or appserver layers. The fact that 'runaway' queries can exist means that ratelimits or even concurrency limits at the traffic or appserver layers are never fully sufficient to shield the DBs.

Ideally, Mediawiki would have an idea of how much time it has remaining to finish serving a request, and always set a max_statement_time that was strictly less than that.

The fact that 'runaway' queries can exist means that ratelimits or even concurrency limits at the traffic or appserver layers are never fully sufficient to shield the DBs.

It would probably already help a lot to simply set the max query execution time to be no more than the max total request execution time. At least of replica connections. We'll want to me more careful about aborting write operations.

The fact that 'runaway' queries can exist means that ratelimits or even concurrency limits at the traffic or appserver layers are never fully sufficient to shield the DBs.

It would probably already help a lot to simply set the max query execution time to be no more than the max total request execution time. At least of replica connections. We'll want to me more careful about aborting write operations.

+100 :)

(To add a bit more context: that would address a lot of the DoS vector issue of this ticket, which is the part I think merits the High priority.)

If I understand correctly, the issue here is long running DB queries, in particular queries that continue to run after the app server as already given up on the request. Could we not make use of MariaDB's max_statement_time feature to ensure these queries get aborted? The user would still be hit with an ugly DB level error, but at least we'd be freeing up resources on the DB servers... This feature seems pretty nice, since limits can be not just per connection, but even per query. I note that T195792: Create Mediawiki DB abstraction for individual query timeouts has been idling around for a while. It used to be waiting for us to upgrade to Maria 10.1. Seems like that was completed about a year ago.

Cross posting from T250527#6067594

There is nothing server-side that needs enablement, this needs implementation on the MW (or any other abstraction layer).

wikiuser@db1089(enwiki)> SET STATEMENT max_statement_time=1 FOR select count(*) from revision;
ERROR 1969 (70100): Query execution was interrupted (max_statement_time exceeded)
wikiuser@db1089(enwiki)>
wikiadmin@10.64.32.115(enwiki)> SET STATEMENT max_statement_time=1 FOR select count(*) from revision;
ERROR 1969 (70100): Query execution was interrupted (max_statement_time exceeded)
wikiadmin@10.64.32.115(enwiki)>

@Marostegui while per-query time limits would have to be implemented in MW, MAX_STATEMEMENT_TIME is also supported per user via the GRANT syntax https://mariadb.com/kb/en/grant/, and even system-wide, see https://mariadb.com/kb/en/server-system-variables/#max_statement_time.

Would it be feasible to use the GRANT method to limit the execution time of statements on production DBs to the maximum request time (or maybe twice that)?

Ah, potential pitfall for implementing this in the MysqlDatabase: "The MySQL version of max_statement_time is defined in milliseconds, not seconds ". So Maria behaves different here! We currently do not distinguish between MySQL and Maria in MW, doing so would potentially require a config change.

@Marostegui while per-query time limits would have to be implemented in MW, MAX_STATEMEMENT_TIME is also supported per user via the GRANT syntax https://mariadb.com/kb/en/grant/, and even system-wide, see https://mariadb.com/kb/en/server-system-variables/#max_statement_time.

Would it be feasible to use the GRANT method to limit the execution time of statements on production DBs to the maximum request time (or maybe twice that)?

We could but I am not sure if trying to solve this via GRANTs is the right approach here to correct users misbehaviours, I believe we need MW to be the one not allowing that on the first place. We already have the query killers there which sometimes cannot even cope with all the bursts.
I think we need to explore the idea of allowing SET STATEMENT max_statement_time=X sent via MW.

Again, if this is the only way, we can definitely explore it, but we have to be very careful and choose the limits wisely, I guess it would be the same as the current query killers to start with (60 seconds) or maybe 65 seconds.

I think we need to explore the idea of allowing SET STATEMENT max_statement_time=X sent via MW.

I agree that this would be a nice feature to have, so we can increase or decrease the limit for individual queries.

However, there should a be a default that is enforced for queries that do not specify a limit explicitly. That default could be set per user via a GRANT, or it could be set per connection. The latter is already possible via site config: in db-eqiad.php and db-codfw.php, in the 'variables' section of the 'serverTemplate', set 'max_statement_time' to a sensible value. I suppose this could even be made to depend on the request, similar to the way we do this in set-time-limit.php, but I don't know how hard it would be to do this with the way the site config works.

@Marostegui what do you think what a sensible timeout would be? Looking at set-time-limit.php, it seems like we currently allow 60 seconds for GET and 200 seconds for POST for regular app servers, 1200 seconds on jobrunners, and 86400 on videoscalers . Perhaps max_statement_time should be that plus 10 seconds? Does that sound reasonable?

jcrespo added a subscriber: Joe.EditedApr 22 2020, 10:25 AM

@Marostegui what do you think what a sensible timeout would be?

We already enforce a 60 second limit on all non-admin MySQL queries FYI, no matter the type. That is not enough to avoid overload- if 10000 new queries happen every second (we can do 30000 with a single instance, and have hundreds of them), a 1-second delay on all queries will mean rejecting connections in 1 second.

If you are asking regarding HTTP timeouts (not mysql), in SRE the right person to ask is probably @Joe.

We already enforce a 60 second limit on all non-admin MySQL queries FYI, no matter the type.

Ah, ok. If that is the case, perhaps max_statement_time is not a solution to this issue at all. I was going by @CDanis comment:

Ideally, Mediawiki would have an idea of how much time it has remaining to finish serving a request, and always set a max_statement_time that was strictly less than that.

In most cases, this would be very close to the full 60 seconds, since most requests complete under one seconds.

We could set a more restrictive limit on queries that are expected to be quick, but we can't really do that for a query that is already causing timeouts, like the one for Special:Contributions. So, would the option to enforce stricter timeouts per query be helpful at all for the problem at hand?

If that is the case, perhaps max_statement_time is not a solution to this issue at all.

Don't get me wrong- It was actually a nice suggestion, and if we end up fixing the grant system for all accounts, it may be a better way that the way it is implemented, that we hadn't considered how until we were fully on 10.1. But it requires grant modification, which is blocked on T249683.

In any case, this is not the key issue in scope of this ticket- the main issue is that those "60 second general limit on mysql" is not enough. Dynamically, Mediawiki needs to know to end its execution - specially under overload- killing things on MySQL is too late or not useful (e.g. a request that does multiple 5-second MySQL queries, or we just end up piling up new connections; memcache being overloaded, mw doing http requests itself, ...) in this and other cases. I wanted @Joe to enter in this discussion (starting Monday/Tuesday) because I believe he is working on an related infra-level solution- however, a software solution should also be considered as a) a first, friendlier approach and b) be aware of the infra solution and react accordingly.

Sorry I cannot be of more help, but it is mostly related to HTTP request handling than MySQL queries, which we already have a timeout (ofc not very useful in many cases, as it is too late/too down the stack). Service ops is the right people to sync with. :-D.

If a solution is given- e.g. a global timer is available, we DBAs will be again helpful by suggesting to add that timeout to the queries (most likely with the option you suggested), too.

I think we need to explore the idea of allowing SET STATEMENT max_statement_time=X sent via MW.

I agree that this would be a nice feature to have, so we can increase or decrease the limit for individual queries.

However, there should a be a default that is enforced for queries that do not specify a limit explicitly. That default could be set per user via a GRANT, or it could be set per connection. The latter is already possible via site config: in db-eqiad.php and db-codfw.php, in the 'variables' section of the 'serverTemplate', set 'max_statement_time' to a sensible value. I suppose this could even be made to depend on the request, similar to the way we do this in set-time-limit.php, but I don't know how hard it would be to do this with the way the site config works.

@Marostegui what do you think what a sensible timeout would be? Looking at set-time-limit.php, it seems like we currently allow 60 seconds for GET and 200 seconds for POST for regular app servers, 1200 seconds on jobrunners, and 86400 on videoscalers . Perhaps max_statement_time should be that plus 10 seconds? Does that sound reasonable?

As I mentioned at T234450#6075056, the current query killers we have are fired after 60 seconds (for wikiuser user only).
It is not the ideal way of killing things cause it relays on events, which have been proven not super useful when the host is super overloaded (like in this case) and some queries slip through, but they normally work.
Hence why I suggested we should explore SET STATEMENT max_statement_time=X on MW itself.

Thank you!

kaldari added a subscriber: kaldari.May 6 2020, 8:47 PM

As this seems to be stalled (or at least not making timely progress), we should either revert the "temporary" 500 limit or declare the limit to be permanent. Alternately, could we create a new user right that allows auto-confirmed users to bypass the 500 limit?

As this seems to be stalled (or at least not making timely progress), we should either revert the "temporary" 500 limit or declare the limit to be permanent. Alternately, could we create a new user right that allows auto-confirmed users to bypass the 500 limit?

Revert the 500 limit, we should go back to 5000. Then:

I am not sure we should revert - this brought us down a few times during a few weeks in a row.
We did inform the users ISP which happened to be always the same set of IPs, but got 0 response from them as far as I can remember.
If we just revert without any measure to prevent this from happening again we are essentially hoping to be lucky, and hope is, unfortunately, not a strategy.
We need MW to be able to rate-limit this

This issue requires manual DBA intervention to get fixed (killing queries manually and in a sustained way) - luckily this incidents happened while we had enough coverage (ie: not during a weekend) but otherwise it can bring us down for a longer time.

sbassett added a comment.EditedMay 7 2020, 4:08 PM

I am not sure we should revert - this brought us down a few times during a few weeks in a row ...
If we just revert without any measure to prevent this from happening again we are essentially hoping to be lucky, and hope is, unfortunately, not a strategy.

Completely agree from a security perspective.

daniel added a comment.May 7 2020, 5:05 PM

Sorry I cannot be of more help, but it is mostly related to HTTP request handling

HTTP requests also have a timeout already. GET requests are shot down after 60 seconds, see https://phabricator.wikimedia.org/source/mediawiki-config/browse/master/wmf-config/set-time-limit.php. If I understand the Excimer docs correctly, this will only be triggered when control is on the PHP process, not while waiting for something external, such as the DB server. @tstarling will know, he wrote this afaik.

Dynamically, Mediawiki needs to know to end its execution - specially under overload- killing things on MySQL is too late or not useful

I'm not sure I understand. MediaWiki is waiting for the DB, that's why the request is taking so long. While the PHP process is blocked on a read from the DB, it can't do anything.

GET requests are shot down after 60 seconds

That is not enough, sadly. :'-(. If all requests to the db took just 1 second (which the 60 second timeout on both the db and the app servers would, of course, allow), we would deny service within 1 second on all dbs, all apis, all requests. While things are not usually that bad, but it is not uncommon to receive thousands of multiple-second queries simultaneously. We cannot put a hard cap of 0.05 seconds on the db- but we could limit the amount of those we get.

I understand there may be a limitation on the PHP execution model (no straightforward way to share data between threads), that is why I suggested to talk to app server sre maintainers to come up with a rate limiting service/model/architecture. The only thing I know is that, if implemented on the db, it is too late because what exactly you say (very correctly)- app servers/dbs will block anyway- we need some limits before calling mysql. By rate limiting certain code endpoints, we may have to still fail a lot/all of those rate-limited requests, but all others may be able to continue unaffected- the end goal is to not saturate external resources, not to make all request work.

Of course, not saying it is easy (this is all easier said than done), or that it is a Mediawiki-only solution (a combination of a service and code is likely to be needed 0:-D). For example, people mentioned poolcounter as a potential option, I cannot say if that is possible or not, or a different approach would be needed- I am not familiar with that. I am not very helpful here, but some of my app server SRE colleagues will probably come up with ideas :-D. I hope at least the idea I am trying to transmit "the db layer is too late" is understood now :-).

So if we're not reverting, would anyone object to the idea mentioned in T234450#6114406 (creating a new user right that allows auto-confirmed users to bypass the 500 limit)?

CDanis added a comment.EditedMay 7 2020, 8:53 PM

I'm going to attempt to restate the history of this bug to see if we all have a shared understanding, not least of all because it's been way too long since I've looked at this and I need to retrace my own steps anyway:

  • We had essentially a full outage on 2019-10-03 as a result of aggressive Special:Contributions scraping from a few IPs (incident status doc)
  • Traffic from this scraper continued recurring and eventually (Oct 22) we set a pagination limit of 500 as a stopgap, in a security patch
  • In November, Anomie authored https://gerrit.wikimedia.org/r/551909 to use PoolCounter to restrict the max concurrency against Special:Contributions from any individual user or anon IP
    • This patch was merged Nov 20 and backported by Reedy on Nov 21
  • On Nov 22, Reedy deployed https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/552228 adding said limit to mediawiki-config, initially allowing 10 in-flight requests from any user/anon-IP
  • On Nov 26, after the above should have been in effect, we saw another load increase against the databases due to the same queries T234450#5693438, with database query killer logs that seem unclear as to the success of the PoolCounter-ization
  • On Nov 27, I limited single-user max-concurrency from 10 down to 3: https://gerrit.wikimedia.org/r/553360

Re: the unclear success of these efforts on Nov 26th's issue, ~75 SQL queries were running simultaneously, and it was theorized that this was do how Mediawiki abandoning a query doesn't result in MySQL abandoning its queries, which is how the conversation got onto things like MAX_STATEMENT_TIME.

The query killer logs from Nov 26th also show evidence of the PoolCounter limits potentially being exceeded -- there are timestamps with many, many queries killed simultaneously, which suggests they were all running simultaneously. However, it's not clear to me whether or not the timestamp is the time of query-killer-execution or the start timestamp of the query-being-killed itself.

Since we set setting the concurrency limit to 3 while keeping the pagination limit at 500, I don't believe we've seen a re-occurrence of an outage due to these queries (@Marostegui please correct me if that's wrong), but that doesn't tell us much.

I think it might make sense to temporarily raise the pagination limit back to 5000, and do some manual testing to make sure the concurrency limits are working as expected, and whether or not a limit of 3 is sufficient to acceptably limit DB load. During such testing I'd like to have around a DBA, watching for how many Special:Contributions queries are running and also making sure we don't overload DBs, someone who understands PoolCounter watching keys there, and someone else sending HTTP queries. (I can at least be the person sending queries, and if I do some homework I could probably be the PoolCounter person as well.)

I find it hard to imagine doing a meaningful test outside of production, but I'd like to be wrong about that.

Other options we have if testing makes it look like this won't be sufficient:

  • Use the slots option of PoolCounter to limit the max concurrency of these queries across all users (see https://www.mediawiki.org/wiki/Manual:$wgPoolCounterConf/en#Configuration_options_shared_by_all_client ).
  • Proceed down the user-right path suggested by Kaldari, so autoconfirmed users can bypass the pagination limit
  • Set a lower-than-usual MAX_STATEMENT_TIME on Special:Contributions queries, which would be a further limit to how much database server load can be consumed by them. This would in practice mean we had a lower 'pagination limit' at times of high DB load, as the SQL query would time out and then MW would serve an error, but that's not an awful scenario.

Thanks for the summary @CDanis, that makes things much clearer!

I'm taking this off the Clinic Duty board, since there isn't anything immediately actionable here. I'm flagging this for further discussion in CPT instead.

Since we set setting the concurrency limit to 3 while keeping the pagination limit at 500, I don't believe we've seen a re-occurrence of an outage due to these queries (@Marostegui please correct me if that's wrong), but that doesn't tell us much.

Correct - I haven't seen this happening again. Whether it is pure luck or they actually tried but the measures taken stopped is impossible to know I guess.

I think it might make sense to temporarily raise the pagination limit back to 5000, and do some manual testing to make sure the concurrency limits are working as expected. During such testing I'd like to have around a DBA, watching for how many Special:Contributions queries are running and also making sure we don't overload DBs, someone who understands PoolCounter watching keys there, and someone else sending HTTP queries. (I can at least be the person sending queries, and if I do some homework I could probably be the PoolCounter person as well.)

I can be around and help. But once the tests are done, I would like to make sure we revert to the current fix, as that's proven to be effective (or maybe the "attacks" didn't happen again - hard to know).

AlexisJazz added a comment.EditedMay 8 2020, 1:30 PM

I am not sure we should revert - this brought us down a few times during a few weeks in a row.
We did inform the users ISP which happened to be always the same set of IPs, but got 0 response from them as far as I can remember.

No, don't bother contacting the ISP. That is assuming bad faith. Make it so that for that IP range the server returns an information page that explains the dumps and links to them instead of the requested Special:Contributions. I think that's your best chance to reach them. Hopefully it's just a poor soul who is trying to mirror Wikipedia in the worst way possible and they will turn to the dumps once they learn about them.

That doesn't mean the underlying issue won't need fixing, but it may help to make the problem less urgent.

CDanis added a comment.EditedMay 8 2020, 5:27 PM

I think it might make sense to temporarily raise the pagination limit back to 5000, and do some manual testing to make sure the concurrency limits are working as expected. During such testing I'd like to have around a DBA, watching for how many Special:Contributions queries are running and also making sure we don't overload DBs, someone who understands PoolCounter watching keys there, and someone else sending HTTP queries. (I can at least be the person sending queries, and if I do some homework I could probably be the PoolCounter person as well.)

I did my PoolCounter homework, then performed one part of this test, which was testing that the concurrency limit was enforced. Based upon both HTTP response timings as well as packet dumps of poolcounter traffic, it is.

(There was also evidence of some queuing, likely at the traffic layer, but in a way that would decrease expected backend load, not increase it.)

I'd still like to understand the potential impact on production with the pagination limit relaxed to 5000. So maybe @Marostegui and I could (next week?) do the following:

  • Limit per-user Contributions concurrency to 2 (instead of 3)
  • Relax pagination max to 5000
  • Send synthetic traffic for a few minutes and see how bad DB load gets
  • If we can't break things, leave the above as-is?
Reedy added a comment.May 8 2020, 5:58 PM

I did my PoolCounter homework, then performed one part of this test, which was testing that the concurrency limit was enforced. Based upon both response timings and packet dumps of poolcounter traffic, it is.

This is obviously good to know

  • Limit per-user Contributions concurrency to 2 (instead of 3)

A limit of 3 being "too high" does seem a reasonably answer. Like, yes it reduced the number of requests that were being problematic, just not reducing them quite enough to prevent issues

eprodromou added a subscriber: eprodromou.

All right, looks like this is a job for Tech Planning in CPT. Hopefully the work we're doing on the API Gateway will help with this somewhat, but until then we need to come up with a better solution.

Change 601361 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/mediawiki-config@master] limit per-user Special:Contributions concurrency to 2

https://gerrit.wikimedia.org/r/601361

CDanis added a comment.EditedJun 1 2020, 3:46 PM
  • Limit per-user Contributions concurrency to 2 (instead of 3)
  • Relax pagination max to 5000
  • Send synthetic traffic for a few minutes and see how bad DB load gets
  • If we can't break things, leave the above as-is?

@Marostegui and I will be doing this tomorrow 13:00 UTC, window reserved on deployment calendar

Change 601361 merged by jenkins-bot:
[operations/mediawiki-config@master] limit per-user Special:Contributions concurrency to 2

https://gerrit.wikimedia.org/r/601361

Change 601728 had a related patch set uploaded (by Reedy; owner: Anomie):
[mediawiki/core@REL1_34] SpecialContributions: Use PoolCounter to limit concurrency

https://gerrit.wikimedia.org/r/601728

Change 601729 had a related patch set uploaded (by Reedy; owner: Anomie):
[mediawiki/core@REL1_33] SpecialContributions: Use PoolCounter to limit concurrency

https://gerrit.wikimedia.org/r/601729

Change 601731 had a related patch set uploaded (by Reedy; owner: Anomie):
[mediawiki/core@REL1_31] SpecialContributions: Use PoolCounter to limit concurrency

https://gerrit.wikimedia.org/r/601731

Jc86035 removed a subscriber: Jc86035.Jun 2 2020, 1:31 PM
CDanis lowered the priority of this task from High to Low.Jun 2 2020, 2:03 PM

The clamping to limit=500 on Special:Contributions queries has been removed.

We were unable to reproduce database issues by performing some simplistic synthetic load testing.

Hopefully this doesn't recur.

Change 601731 merged by jenkins-bot:
[mediawiki/core@REL1_31] SpecialContributions: Use PoolCounter to limit concurrency

https://gerrit.wikimedia.org/r/601731

Change 601728 merged by jenkins-bot:
[mediawiki/core@REL1_34] SpecialContributions: Use PoolCounter to limit concurrency

https://gerrit.wikimedia.org/r/601728

Change 601729 merged by jenkins-bot:
[mediawiki/core@REL1_33] SpecialContributions: Use PoolCounter to limit concurrency

https://gerrit.wikimedia.org/r/601729

@Bugreporter I don't think that's a duplicate. This task has a misleading title; it isn't about any WMFTimeoutException on Special:Contributions, but a particular special kind from months ago that was creating widespread database issues, because of excessive traffic from a scraper.

CDanis renamed this task from Some Special:Contributions requests cause "Error: 0" from database or WMFTimeoutException to Special:Contributions requests with a high &limit= caused excessive database load.Jul 27 2020, 12:56 PM
CDanis closed this task as Resolved.
CDanis claimed this task.

We haven't seen a recurrence of this so I am optimistically resolving.

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:43 PM