Long queries not being killed
Closed, ResolvedPublic2 Estimated Story Points
Actions

Assigned To

Authored By

	Samwilson
	Jul 21 2017, 4:07 AM

Description

Are long/slow queries being killed correctly (and logged) on XTools servers?

Errors such as the following ("MySQL server has gone away") are being raised regularly, perhaps because the MySQL server is killing queries rather than them being killed by the script.

[2017-07-20 22:27:35] request.CRITICAL: Exception thrown when handling an exception (Twig_Error_Runtime: An exception has been thrown during the rendering of a template ("An exception occurred while executing 'SELECT lag FROM heartbeat_p.heartbeat h RIGHT JOIN meta_p.wiki w ON concat(h.shard, ".labsdb")=w.slice WHERE dbname LIKE :project OR name LIKE :project OR url LIKE :project LIMIT 1' with params ["enwiki"]: SQLSTATE[HY000]: General error: 2006 MySQL server has gone away"). at /var/www/app/Resources/views/base.html.twig line 146) {"exception":"[object] (Twig_Error_Runtime(code: 0): An exception has been thrown during the rendering of a template (\"An exception occurred while executing 'SELECT lag FROM heartbeat_p.heartbeat h\n RIGHT JOIN meta_p.wiki w ON concat(h.shard, \".labsdb\")=w.slice\n WHERE dbname LIKE :project OR name LIKE :project OR url LIKE :project LIMIT 1' with params [\"enwiki\"]:\n\nSQLSTATE[HY000]: General error: 2006 MySQL server has gone away\"). at /var/www/app/Resources/views/base.html.twig:146, Doctrine\\DBAL\\Exception\\DriverException(code: 0): An exception occurred while executing 'SELECT lag FROM heartbeat_p.heartbeat h\n RIGHT JOIN meta_p.wiki w ON concat(h.shard, \".labsdb\")=w.slice\n WHERE dbname LIKE :project OR name LIKE :project OR url LIKE :project LIMIT 1' with params [\"enwiki\"]:\n\nSQLSTATE[HY000]: General error: 2006 MySQL server has gone away at /var/www/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/AbstractMySQLDriver.php:115, Doctrine\\DBAL\\Driver\\PDOException(code: HY000): SQLSTATE[HY000]: General error: 2006 MySQL server has gone away at /var/www/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/PDOStatement.php:93, PDOException(code: HY000): SQLSTATE[HY000]: General error: 2006 MySQL server has gone away at /var/www/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/PDOStatement.php:91)"} []

Related Objects

Mentioned In: T163284: Optimize edit count queries in XTools
T71127: Discrepancies with logging table on different wikis
Mentioned Here: T71127: Discrepancies with logging table on different wikis

Event Timeline

Samwilson created this task.Jul 21 2017, 4:07 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 21 2017, 4:07 AM

Samwilson triaged this task as Medium priority.Jul 21 2017, 4:07 AM

Samwilson added a project: Community-Tech-Sprint.

Samwilson set the point value for this task to 2.

Samwilson updated the task description. (Show Details)Jul 21 2017, 4:12 AM

Matthewrbowker moved this task from Backlog to Working on the XTools board.Jul 21 2017, 4:55 AM

The killer is actually killing now; see https://xtools.wmflabs.org/killed_slow_queries.txt for its victims.

The problem was that it was querying the wrong database server (oops). It was using toolsdb which is only used for tiny writes for usage hits, and so there was never anything killed.

I've also changed it from being launched by cron to be a daemon. It can be killed at any time by creating /tmp/kill-pt-sentinel.

Docs updated.

Yay! Thanks for this. Hopefully this will solve the issue @DannyH ran into the other day, where one of the Apache instances was hanging. Just a guess... I have the same issue on my local, where the server is slow if it's waiting on queries to finish.

To no avail, I did some research to see if there was a way to recover a session. So say I query the Edit Counter for MusikAnimal. It's hanging, waiting to finish queries, I get impatient and hit refresh to try again. I assume at this point it creates a new set of queries, not going off of the cache because the original ones hadn't finished yet. Do you think there's anyway to recover the last request? Maybe the server could be like "Oh, these exact same queries are running right now, let's use those" kind of thing. I don't know if this is possible but it might be worth our time to find out.

@Samwilson: Wow, that log is already enormous. Do we have log rotation in place for it? Maybe we should have it only log the killed queries rather than every single check (which happens every 45 seconds).

This is apparently killing queries from the legacy XTools, as evidenced by the presence of s51187__xtools_tmp.

I also noticed it killed a query that processed this page, which consists of only 7 revisions. I ran the query manually and it finished in 0.02 seconds. I think something is awry!

Another oddity, some queries on databases other than enwiki are crazy slow. Enwiki I assume has the biggest logging table, but check this out:

SELECT log_timestamp, log_params FROM `enwiki_p`.`logging_logindex`
WHERE log_type = 'block'
AND log_action = 'block'
AND log_timestamp > 0
AND log_title = 'MusikAnimal';

Empty set (0.86 sec)

And on frwiki (a similar one was killed via our script):

SELECT log_timestamp, log_params FROM `frwiki_p`.`logging_logindex`
WHERE log_type = 'block'
AND log_action = 'block'
AND log_timestamp > 0
AND log_title = 'MusikAnimal';

Empty set (1 min 24.44 sec)

Maybe this is partly because of the availability zones, having to make the round trip to wherever frwiki is hosted?

In T171264#3461402, @kaldari wrote:

@Samwilson: Wow, that log is already enormous. Do we have log rotation in place for it? Maybe we should have it only log the killed queries rather than every single check (which happens every 45 seconds).

Sorry, yes I left the verbosiness in! Gone now. And it is being rotated, according to /var/lib/logrotate/status it's fine, but you're right it wasn't doing it. I've manually rotated it just now, and will try to figure out why it's not doing it on its own.

In T171264#3461461, @MusikAnimal wrote:

This is apparently killing queries from the legacy XTools, as evidenced by the presence of s51187__xtools_tmp.

I also noticed it killed a query that processed this page, which consists of only 7 revisions. I ran the query manually and it finished in 0.02 seconds. I think something is awry!

Yeah, it's connecting as the production XTools DB user, which is used for both old and new XTools. There isn't another way of creating a new DB user (other than creating a new Toolforge account). It shouldn't matter though, because it's supposed to be only killing long-running statements, and we probably want to do that on old XTools as well. We can specify a tighter regex than (select|Select|SELECT) though (but that gets hard I think, because of things like leading comments etc.).

The logs are not being rotated. :-(

Yes they are. They're being rotated daily, not after 1kB as requested (there's some confusion it seems about the interaction between daily rotation and size limiting... but I think daily is fine for what want isn't it?).

Matthewrbowker moved this task from Working to Complete on the XTools board.Jul 24 2017, 9:35 AM

In T171264#3461477, @MusikAnimal wrote:
Another oddity, some queries on databases other than enwiki are crazy slow. Enwiki I assume has the biggest logging table, but check this out:
SELECT log_timestamp, log_params FROM `enwiki_p`.`logging_logindex`
WHERE log_type = 'block'
AND log_action = 'block'
AND log_timestamp > 0
AND log_title = 'MusikAnimal';

Empty set (0.86 sec)
And on frwiki (a similar one was killed via our script):
SELECT log_timestamp, log_params FROM `frwiki_p`.`logging_logindex`
WHERE log_type = 'block'
AND log_action = 'block'
AND log_timestamp > 0
AND log_title = 'MusikAnimal';

Empty set (1 min 24.44 sec)

FYI this is because of T71127

kaldari mentioned this in T71127: Discrepancies with logging table on different wikis.Jul 25 2017, 12:21 AM

• DannyH edited projects, added Community-Tech; removed Community-Tech-Sprint.Aug 2 2017, 12:29 AM

• DannyH moved this task from New & TBD Tickets to Archive on the Community-Tech board.

MusikAnimal mentioned this in T163284: Optimize edit count queries in XTools.Aug 4 2017, 5:49 PM

I assume there is a AND log_namespace = 0 missing on the query, probably those indexes are a bug rather than a feature.

In T171264#3505181, @jcrespo wrote:

I assume there is a AND log_namespace = 0 missing on the query, probably those indexes are a bug rather than a feature.

Yes that does the trick (with log_namespace = 2)! :D This was overlooked because we're checking a block log, which only exists on users, and log_namespace feels like it was meant to pertain to a page. Anyway thanks for the tip! 1 min 24 secs is now down to 0.00 secs

Long queries not being killedClosed, ResolvedPublic2 Estimated Story PointsActions

Description

Related Objects

Event Timeline

Long queries not being killed
Closed, ResolvedPublic2 Estimated Story Points
Actions