Page MenuHomePhabricator

Long queries not being killed
Closed, ResolvedPublic2 Estimated Story Points

Description

Are long/slow queries being killed correctly (and logged) on XTools servers?

Errors such as the following ("MySQL server has gone away") are being raised regularly, perhaps because the MySQL server is killing queries rather than them being killed by the script.

[2017-07-20 22:27:35] request.CRITICAL: Exception thrown when handling an exception (Twig_Error_Runtime: An exception has been thrown during the rendering of a template ("An exception occurred while executing 'SELECT lag FROM heartbeat_p.heartbeat h RIGHT JOIN meta_p.wiki w ON concat(h.shard, ".labsdb")=w.slice WHERE dbname LIKE :project OR name LIKE :project OR url LIKE :project LIMIT 1' with params ["enwiki"]: SQLSTATE[HY000]: General error: 2006 MySQL server has gone away"). at /var/www/app/Resources/views/base.html.twig line 146) {"exception":"[object] (Twig_Error_Runtime(code: 0): An exception has been thrown during the rendering of a template (\"An exception occurred while executing 'SELECT lag FROM heartbeat_p.heartbeat h\n RIGHT JOIN meta_p.wiki w ON concat(h.shard, \".labsdb\")=w.slice\n WHERE dbname LIKE :project OR name LIKE :project OR url LIKE :project LIMIT 1' with params [\"enwiki\"]:\n\nSQLSTATE[HY000]: General error: 2006 MySQL server has gone away\"). at /var/www/app/Resources/views/base.html.twig:146, Doctrine\\DBAL\\Exception\\DriverException(code: 0): An exception occurred while executing 'SELECT lag FROM heartbeat_p.heartbeat h\n RIGHT JOIN meta_p.wiki w ON concat(h.shard, \".labsdb\")=w.slice\n WHERE dbname LIKE :project OR name LIKE :project OR url LIKE :project LIMIT 1' with params [\"enwiki\"]:\n\nSQLSTATE[HY000]: General error: 2006 MySQL server has gone away at /var/www/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/AbstractMySQLDriver.php:115, Doctrine\\DBAL\\Driver\\PDOException(code: HY000): SQLSTATE[HY000]: General error: 2006 MySQL server has gone away at /var/www/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/PDOStatement.php:93, PDOException(code: HY000): SQLSTATE[HY000]: General error: 2006 MySQL server has gone away at /var/www/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/PDOStatement.php:91)"} []

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 21 2017, 4:07 AM
Samwilson triaged this task as Medium priority.Jul 21 2017, 4:07 AM
Samwilson set the point value for this task to 2.
Samwilson updated the task description. (Show Details)Jul 21 2017, 4:12 AM
Matthewrbowker moved this task from Backlog to Working on the XTools board.Jul 21 2017, 4:55 AM

The killer is actually killing now; see https://xtools.wmflabs.org/killed_slow_queries.txt for its victims.

The problem was that it was querying the wrong database server (oops). It was using toolsdb which is only used for tiny writes for usage hits, and so there was never anything killed.

I've also changed it from being launched by cron to be a daemon. It can be killed at any time by creating /tmp/kill-pt-sentinel.

Docs updated.

MusikAnimal added subscribers: DannyH, MusikAnimal.

Yay! Thanks for this. Hopefully this will solve the issue @DannyH ran into the other day, where one of the Apache instances was hanging. Just a guess... I have the same issue on my local, where the server is slow if it's waiting on queries to finish.

To no avail, I did some research to see if there was a way to recover a session. So say I query the Edit Counter for MusikAnimal. It's hanging, waiting to finish queries, I get impatient and hit refresh to try again. I assume at this point it creates a new set of queries, not going off of the cache because the original ones hadn't finished yet. Do you think there's anyway to recover the last request? Maybe the server could be like "Oh, these exact same queries are running right now, let's use those" kind of thing. I don't know if this is possible but it might be worth our time to find out.

@Samwilson: Wow, that log is already enormous. Do we have log rotation in place for it? Maybe we should have it only log the killed queries rather than every single check (which happens every 45 seconds).

MusikAnimal added a comment.EditedJul 21 2017, 6:23 PM

This is apparently killing queries from the legacy XTools, as evidenced by the presence of s51187__xtools_tmp.

I also noticed it killed a query that processed this page, which consists of only 7 revisions. I ran the query manually and it finished in 0.02 seconds. I think something is awry!

Another oddity, some queries on databases other than enwiki are crazy slow. Enwiki I assume has the biggest logging table, but check this out:

SELECT log_timestamp, log_params FROM `enwiki_p`.`logging_logindex`
WHERE log_type = 'block'
AND log_action = 'block'
AND log_timestamp > 0
AND log_title = 'MusikAnimal';

Empty set (0.86 sec)

And on frwiki (a similar one was killed via our script):

SELECT log_timestamp, log_params FROM `frwiki_p`.`logging_logindex`
WHERE log_type = 'block'
AND log_action = 'block'
AND log_timestamp > 0
AND log_title = 'MusikAnimal';

Empty set (1 min 24.44 sec)

Maybe this is partly because of the availability zones, having to make the round trip to wherever frwiki is hosted?

@Samwilson: Wow, that log is already enormous. Do we have log rotation in place for it? Maybe we should have it only log the killed queries rather than every single check (which happens every 45 seconds).

Sorry, yes I left the verbosiness in! Gone now. And it is being rotated, according to /var/lib/logrotate/status it's fine, but you're right it wasn't doing it. I've manually rotated it just now, and will try to figure out why it's not doing it on its own.

This is apparently killing queries from the legacy XTools, as evidenced by the presence of s51187__xtools_tmp.

I also noticed it killed a query that processed this page, which consists of only 7 revisions. I ran the query manually and it finished in 0.02 seconds. I think something is awry!

Yeah, it's connecting as the production XTools DB user, which is used for both old and new XTools. There isn't another way of creating a new DB user (other than creating a new Toolforge account). It shouldn't matter though, because it's supposed to be only killing long-running statements, and we probably want to do that on old XTools as well. We can specify a tighter regex than (select|Select|SELECT) though (but that gets hard I think, because of things like leading comments etc.).

The logs are not being rotated. :-(

Samwilson closed this task as Resolved.Jul 24 2017, 2:43 AM
Samwilson moved this task from In Development to Q1 2018-19 on the Community-Tech-Sprint board.

Yes they are. They're being rotated daily, not after 1kB as requested (there's some confusion it seems about the interaction between daily rotation and size limiting... but I think daily is fine for what want isn't it?).

Matthewrbowker moved this task from Working to Complete on the XTools board.Jul 24 2017, 9:35 AM

Another oddity, some queries on databases other than enwiki are crazy slow. Enwiki I assume has the biggest logging table, but check this out:

SELECT log_timestamp, log_params FROM `enwiki_p`.`logging_logindex`
WHERE log_type = 'block'
AND log_action = 'block'
AND log_timestamp > 0
AND log_title = 'MusikAnimal';

Empty set (0.86 sec)

And on frwiki (a similar one was killed via our script):

SELECT log_timestamp, log_params FROM `frwiki_p`.`logging_logindex`
WHERE log_type = 'block'
AND log_action = 'block'
AND log_timestamp > 0
AND log_title = 'MusikAnimal';

Empty set (1 min 24.44 sec)

FYI this is because of T71127

DannyH moved this task from New & TBD Tickets to Archive on the Community-Tech board.
jcrespo added a subscriber: jcrespo.Aug 7 2017, 8:18 AM

I assume there is a AND log_namespace = 0 missing on the query, probably those indexes are a bug rather than a feature.

I assume there is a AND log_namespace = 0 missing on the query, probably those indexes are a bug rather than a feature.

Yes that does the trick (with log_namespace = 2)! :D This was overlooked because we're checking a block log, which only exists on users, and log_namespace feels like it was meant to pertain to a page. Anyway thanks for the tip! 1 min 24 secs is now down to 0.00 secs