Page MenuHomePhabricator

Reduce max execution time of interactive queries or a better detection and killing of bad query patterns
Closed, ResolvedPublic

Description

Right now, interactive queries (those run as a direct result of an http request) have an execution limit of 300 seconds- this is implemented with an event server side. However, despite having a max_connection limit of 10000 connections (those are connections allowed- the maximum concurrency is 64 threads), even 30 second to 1 minute queries (T160914) can overload the server if done fast enough. This afects only interactive queries- either http brower requests or api queries, not terbium, maintenance, dumps, vslow or other queries.

Max execution limit is a protection, but it kills legitimate queries to avoid overload, so it is not something to do freely.

Some things that can be done:

  • Reduce interactive execution time to 1 minute or 30 seconds
  • Reduce interactive execution time inversely proportional to the number of connected threads
  • Implement matching patterns, like killing queries that are long (>1 second) and similar in pattern - e.g. same digest (needs complex implementation)
  • Increase the maximum number of concurrent threads (can bring down the server if done incorrectly, and above 64 threads, the throughput is not really better)

Event Timeline

jcrespo created this task.Mar 21 2017, 10:02 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 21 2017, 10:02 AM
Anomie added a subscriber: Anomie.Mar 21 2017, 3:21 PM

Some data: Assuming I did the hive query correctly, there seem to have been 10419936343 API requests so far this month. 10419765923 (99.9984%) completed in under 30 seconds. 10419899851 (99.9996%) in under 60 seconds. That leaves 36492 that took 60+ seconds. 364 took 300+ seconds.

All these times are wall times spent executing the action and printing the result, both in PHP and waiting for the database (potentially for multiple queries). They don't include time for request setup or shutdown, and probably don't include time spent actually sending the output data to the client.

That is really nice data! I was thinking of maybe doing dynamic timeout: if there is low concurrency, allow most of them, if there is higher, be more strict. When we reach alarming levels (5000 connections, for example), kill as fast as possible in descending time. This is not really a fix for the issues, but a patch to assure database reliability, moving the problem elsewhere.

On the other side- if we setup a lower threshold. E.g. 30 seconds, maybe we would force people to both check logs more frequently and maintain more the code? Not sure about that. Needs thinking.

One thing that could be done is mediawiki to log clearly those killed queries. I think right now they log "lost contact with server"- which may be not 100% understood, thinking it is a mistake, and not a desired output.

That is really nice data! I was thinking of maybe doing dynamic timeout: if there is low concurrency, allow most of them, if there is higher, be more strict. When we reach alarming levels (5000 connections, for example), kill as fast as possible in descending time. This is not really a fix for the issues, but a patch to assure database reliability, moving the problem elsewhere.

I like that, but we'd need to play quite a bit with the thresholds and the different hardware, right? As in most powerful hardware would have different timeouts and killing policies. Maybe that can get too complicated.

On the other side- if we setup a lower threshold. E.g. 30 seconds, maybe we would force people to both check logs more frequently and maintain more the code? Not sure about that. Needs thinking.

I would go for this option if possible. Giving a good heads up so people are aware that this is coming.

One thing that could be done is mediawiki to log clearly those killed queries. I think right now they log "lost contact with server"- which may be not 100% understood, thinking it is a mistake, and not a desired output.

Agreed!

Maybe that can get too complicated.

Not necessarily, it is just a few extra queries on:

https://phabricator.wikimedia.org/diffusion/OSOF/browse/master/dbtools/events_coredb_slave.sql;85b5637931103ef6d9cffcccdc33e24cd90c6197$80

Alternatively, we get rid of the events and puppetize pt-kill properly.

Alternatively, we get rid of the events and puppetize pt-kill properly.

You already know my opinion about events! :-)

Tgr added a subscriber: Tgr.Mar 21 2017, 8:27 PM

Is it a good idea to attempt the handling of such problems in the DB layer, where not a lot of origin information is available? If the problem is a single non-malicious user sending an unreasonable amount of requests, that can probably be handled better in Varnish or MediaWiki where the requests can still be identified (based on IP/UA) as coming from the user. Basically just throttle requests that take long, e.g. limit the number of parallel connections a single (external) IP is allowed to hold. There are existing mechanisms for that already, such as the ping limiter and the pool counter in MediaWiki (used for things like limiting login attempts and expensive image thumbnailing requests).

Given that we limited API requests to 60s execution time until HHVM broke timeouts temporarily (see T97192), reducing the database query wall time limit to 60s for API requests seems like a good idea in any case. There is little point in continuing to process database queries for a request that was timed out already, and based on @Anomie's data only a tiny fraction of requests are exceeding even 30s.

I also agree very much with @Tgr's point that rate limiting is better handled at the app layer, as more information and better means of responding are available there.

Is it a good idea to attempt the handling of such problems in the DB layer

No, of course it is not. But this is the layer I control and can do something about it- and the layers on top, so far have done no efforts to solve that globally (some of these tasks have been open for 3 years already: T64615 T119919 T118186). As of the writing of these lines I can see 47 interactive queries taking over 300 seconds in the last hour, and the only reason they only take 300-310 seconds and don't overflow the db is the existing watchdogs killing queries. When those watchdogs have failed to execute by mistake, the database has been DOSed by normal queries.

I showed you my code here T160984#3118261 and my plans to avoid future outages. Please show me your mediawiki CRs and/or infrastructure puppet changes for There are existing mechanisms for that already, such as the ping limiter and the pool counter in MediaWiki (used for things like limiting login attempts and expensive image thumbnailing requests). Meaning- send a counter proposal and merge it on production. Once the code guarantees it will not DOS the databases, I will the first person to disable such watchdogs (which have prevented outages for at least 5 years already- and probably the reason why service was still up during https://wikitech.wikimedia.org/wiki/Incident_documentation/20170320-Special_AllPages).

I see comments on how thing "should" be done differently, here, and on T160985#3119182, but I do not see the code backing those "easier and better" alternative proposals. I an tired of waiting for someone else to do it better. Also this seems like a good incentive to accelerate the work on those tickets.

jcrespo claimed this task.Mar 28 2017, 11:20 AM
jcrespo triaged this task as High priority.
jcrespo moved this task from Triage to In progress on the DBA board.

Another close-to-outage incident with a long running query at: T159319#3144183 , This confirms I need to do this ASAP.

364 took 300+ seconds

That is probably biased of queries that complete successfuly. 300 second queries are killed- sometimes they go over 300 seconds as it takes a bit more than that to be killed. So probably it is more than that, as I can see around that number of queries killed being around that number (maybe more) per day. https://logstash.wikimedia.org/goto/b04968f2ef9691a6257d01eef2d47440 Similar stats on tendril.

Change 346559 had a related patch set uploaded (by Jcrespo):
[operations/software@master] Kill long running queries longer with shorter terms:

https://gerrit.wikimedia.org/r/346559

Anomie added a comment.Apr 5 2017, 6:20 PM

I ran the query of API request lengths again, for all of March this time. 15682708428 total queries.

Time (seconds)# of queries% of queries
1+992693080.632986%
10+11213740.007150%
30+2099000.001338%
60+496120.000316%
120+198500.000127%
300+7610.000005%

364 took 300+ seconds

That is probably biased of queries that complete successfuly. 300 second queries are killed- sometimes they go over 300 seconds as it takes a bit more than that to be killed.

It looks like the errors in the dashboard you link are coming after less than 300 seconds. For example, https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2017.04.05/mediawiki?id=AVs9HeoGlL3ktRIGOVY1 is associated with an API request that was logged at only 120.007 seconds total. It's also including requests that aren't coming via the API.

Glancing through the data, there seem to be an unusually large number of requests logged as taking 30–35.999 seconds, a small bump right at 60 seconds, and a very big bump at 120 seconds (13159 queries took between 120 and 120.999 seconds, compared to 218 taking 119–119.999 and 197 taking 122–122.999). There's also a bump at 300 seconds (20, versus single digits before and after).

Anomie added a comment.Apr 5 2017, 6:29 PM

If I had to guess, I'd guess the big bump at 120 seconds is something killing the original query after 60 seconds, then MediaWiki retries once (see here and here) and the retried query is killed at 60 seconds too.

You can double check my stats here: https://tendril.wikimedia.org/report/slow_queries?host=^db&user=wikiuser&schema=wik&hours=1 (last hour of queries). If the queries are abandoned by the application but still running on the db without being killed (something I reported as a bug at: T149421, and the other linked tasks ), more reasons then to kill them at the db layer, while also fixing the HHVM bug. :-P

Change 346559 merged by Jcrespo:
[operations/software@master] Kill long running queries with stricter conditions

https://gerrit.wikimedia.org/r/346559

Mentioned in SAL (#wikimedia-operations) [2017-04-26T11:32:06Z] <jynus> applying new events_coredb_slave.sql on db2055 T160984

It works:

root@db2055.codfw.wmnet[ops]> SELECT * FROM event_log WHERE stamp > '2017-04-26 11:32'\G
*************************** 1. row ***************************
server_id: 180367367
    stamp: 2017-04-26 11:35:03
    event: wmf_slave_wikiuser_slow (>60)
  content: kill 883668901; SELECT /* ApiQueryRevisions::run  */  rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user
*************************** 2. row ***************************
server_id: 180367367
    stamp: 2017-04-26 11:35:03
    event: wmf_slave_wikiuser_slow (>60)
  content: kill 883668901; SELECT /* ApiQueryRevisions::run  */  rev_id,rev_page,rev_text_id,rev_timestamp,rev_comment,rev_user
2 rows in set (0.04 sec)

I intend to mass-deploy this once 1.29.0-wmf.21 is everywhere.

Mentioned in SAL (#wikimedia-operations) [2017-04-28T15:40:31Z] <jynus> deploying new events_coredb_slave.sql on codfw T160984

jcrespo closed this task as Resolved.Apr 28 2017, 6:11 PM

This has been slowly deployed to all mediawiki production slaves- No apparent errors on kibana (or at least no more than usual). This is the typical change that only has issues with weird race conditions. However, for now the only thing that this seem to affect is that queries fail faster (but not much more often). I will continue to monitor for strange patterns- but as a first noticeable effect- long query monitoring (SHOW PROCESSLIST) seems happier.

This is working nicely:

Note that despite the 60 second limit, that is a minimum- Events execute every 30 seconds, with a lock avoiding multiple instances. That means that a query can take 60 + 30 + processing time seconds on being killed.

However, out of dozens of thousands, I just saw one query being killed after 567 seconds of execution time on db2065. I will check for contention or logic failures- it is not important if one query out of many is not being killed, but it could point to a bug.