Page MenuHomePhabricator

Tool Labs queries die
Closed, ResolvedPublic

Description

Queries died after 5 hours. Frustrating when trying to precompute a look-up table. Toolserver had SLOW_OK to indicate queries designed to run longer than 20 minutes. And email when the query-killer killed them.

mysql> SELECT @@connect_timeout, @@interactive_timeout, @@long_query_time, @@wait_timeout;
+-------------------+-----------------------+-------------------+----------------+
| @@connect_timeout | @@interactive_timeout | @@long_query_time | @@wait_timeout |
+-------------------+-----------------------+-------------------+----------------+
|                 3 |                 28800 |         10.000000 |          28800 |
+-------------------+-----------------------+-------------------+----------------+

T127228 may be related

Event Timeline

Dispenser raised the priority of this task from to Needs Triage.
Dispenser updated the task description. (Show Details)
Dispenser added projects: DBA, Toolforge.
Dispenser subscribed.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

It now dies after 1 hour!
https://commons.wikimedia.org/wiki/User:Dispenser/Wrong_Extension
Sat Feb 20 13:20:32 UTC 2016
ERROR 2013 (HY000) at line 2: Lost connection to MySQL server during query
Sat Feb 20 14:20:32 UTC 2016

Dispenser renamed this task from Labs queries die after 5 hours to Tool Labs queries die.Feb 22 2016, 4:05 AM
Dispenser updated the task description. (Show Details)

5 hour queries are not allowed on replica servers. The workaround is to split the query in smaller ones.

I temporarily restricted queries to 1 hour until I discover the tool responsible of slowness in labsdb1003 (T127228). I've reverted that change now.

I should explain that the "5 hour queries" is not something definitive; we are currently with reduced redundancy temporarily due to hardware failure, so I am asking kindly to not overload much the current service. Hopefully *that will change* in the short future.

Also, if you have suggestions on how to combine OLTP and OALP queries for over 5000 users, I would like to hear them. I am open to suggestions, but even more open to suggestions including patches. :-)

The query-killer daemon (IIRC) was written in Java and open source in the DaB's Toolserver SVN repo. It recognized flags:

  • /* SLOW_OK */ Indicates query is not a runaway or butchered query
  • /* LIMIT:XX */ Kill queries after XX seconds, useful to limit interactive CGI processes. MySQL has since implemented statement timeouts
  • /* NM */ No Mail. Avoid sending email of the killed SQL to the user account.

Now that ran every 60 seconds which wasn't fast enough for me. So I reimplemented it in python running in screen (for monitoring) which runs every 10 seconds and kills in 1 second. It also has a frenzy mode when max_user_connections was nearly tapped out (someone thought 15 connections was enough when browsers had 6 connection pipelining).

I have my own scripts to control wild queries, however, not every user uses that syntax (which has not been ever documented in our servers) and I was strongly suggested not to impose per-user limit.

In fact, I have been highly criticized publicly by other user for imposing limits per user or on very long queries and that "we should just by more servers". Tools users should agree themselves to such limits and I will gladly implement those, but I cannot without your consensus.

Let me propose you something, is this specific issue solved or do you still have issues?

After that, we can create a new ticket/mail thread to discuss a new tools policy that everybody agrees with.

Change 272965 had a related patch set uploaded (by Jcrespo):
labsdb1003 is a bit overloaded right now, move commonswiki to 1

https://gerrit.wikimedia.org/r/272965

Change 272965 merged by Jcrespo:
labsdb1003 is a bit overloaded right now, move commonswiki to 1

https://gerrit.wikimedia.org/r/272965

jcrespo mentioned this in Unknown Object (Task).Feb 25 2016, 9:51 AM
chasemp triaged this task as Medium priority.Apr 4 2016, 2:28 PM

Queries are being killed after 2 hours since June, leaving data for tools stale.

jcrespo claimed this task.

Dispenser, we just finished setting up 3 new labsdb servers with 2 separate entry points- one for fast webrequests, and another for analytics-like queries. On the second set, we will allow longer queries running as long as they as reasonable. Sadly, long running queries on the current servers would make them crash due to OOM, so a limit had to be enforced (not something we liked).

Please wait for the announcement, but I believe that will fulfill your needs regarding long running queries. If you need access now and cannot wait for proper documentation, please contact me back.

The Toolserver actually tried this, a "fast" server that would kill queries after 60 seconds. It was abandoned as tool authors wouldn't rewrite scripts to use it. The servers were converted to secondaries in a DNS round-robin for reliability. Ultimately, the decision was made to break scripts by requiring SLOW_OK to indicate non-runaway queries.

The Toolserver actually tried this, a "fast" server that would kill queries after 60 seconds. It was abandoned as tool authors wouldn't rewrite scripts to use it. The servers were converted to secondaries in a DNS round-robin for reliability. Ultimately, the decision was made to break scripts by requiring SLOW_OK to indicate non-runaway queries.

We are flexible- in the future we may be able to assign different users or queries to different services transparently by query inspection. Allow us to try and give the 2 services. It was explicitly requested by users (increase the query time limit)- and with the current hardware that wasn't possible. The final mapping between existing users to the analytics or web request service, and how that will be done is WIP- even on the worst case scenario, if we fail to provide such a service; 3 newer servers with high availability and InnoDB that we can fix easily copying from production will be better than 2 old, dying servers with tokudb -very difficult to fix (poor failover mechanisms), and with no windows of maintenance.

Support may come in the future to help developers make code changes? (or at least we are trying to get it). Better documentation, too. And I believe better try and fail than do nothing :-), and we intend it to be as transparent as possible so no code changes should be needed by tools (we have to migrate away from the current servers no matter what).