Page MenuHomePhabricator

Large MySQL query to commonswiki.labsdb dies with `ERROR 2013 (HY000) at line 1: Lost connection to MySQL server during query`
Closed, InvalidPublic

Description

The query is

select cl_from, page_id, cl_type from categorylinks,page where cl_type!="page" and page_namespace=14 and page_title=cl_to order by page_id;

It is performed commonswiki_p on commonswiki.labsdb (alias for s4.analytics.db.svc.eqiad.wmflabs).

The expected result size is >300 million rows:

MariaDB [commonswiki_p]> select count(*) from categorylinks,page where cl_type!="page" and page_namespace=14 and page_title=cl_to;
+-----------+
| count(*)  |
+-----------+
| 332109988 |
+-----------+
1 row in set (21 min 14.25 sec)

Event Timeline

This might be related to T208916 or the query killer (per @Krenair)

@Banyek This appears to be affected by the query killer wmf-pt-kill. What is the current limit placed on things? I wonder if this just goes too long or if it should be changed/tuned to allow longer queries?

@dschwen Do you have some idea what time it gets killed at? Can you set a timer in your script (if you haven't already)?

ok, let me add this and manually trigger a run...

Update: 16 minutes and still running...

Dang, that test run succeeded (took 27min 3sec). Let me get back to this ticket tomorrow after a few updates ran. Let's see if I get any failures.

Ok, here is the log from the past 30h

START Thu Nov 8 22:27:10 UTC 2018
SUCCESS Thu Nov 8 22:54:13 UTC 2018
START Fri Nov 9 00:13:02 UTC 2018
SUCCESS Fri Nov 9 00:50:41 UTC 2018
START Fri Nov 9 02:13:01 UTC 2018
SUCCESS Fri Nov 9 02:43:59 UTC 2018
START Fri Nov 9 04:13:01 UTC 2018
SUCCESS Fri Nov 9 05:28:25 UTC 2018
START Fri Nov 9 06:13:01 UTC 2018
SUCCESS Fri Nov 9 07:20:39 UTC 2018
START Fri Nov 9 10:13:01 UTC 2018
FAILED Fri Nov 9 10:14:08 UTC 2018
START Fri Nov 9 12:13:01 UTC 2018
FAILED Fri Nov 9 12:14:07 UTC 2018
START Fri Nov 9 14:13:01 UTC 2018
FAILED Fri Nov 9 14:14:07 UTC 2018
START Fri Nov 9 16:13:01 UTC 2018
FAILED Fri Nov 9 16:14:06 UTC 2018
START Fri Nov 9 18:13:01 UTC 2018
FAILED Fri Nov 9 18:14:06 UTC 2018
START Fri Nov 9 20:13:01 UTC 2018
SUCCESS Fri Nov 9 20:38:38 UTC 2018
START Fri Nov 9 22:13:01 UTC 2018
SUCCESS Fri Nov 9 23:40:50 UTC 2018
START Sat Nov 10 00:13:01 UTC 2018
SUCCESS Sat Nov 10 00:52:30 UTC 2018
START Sat Nov 10 02:13:01 UTC 2018
SUCCESS Sat Nov 10 02:44:03 UTC 2018
START Sat Nov 10 04:13:01 UTC 2018
SUCCESS Sat Nov 10 04:53:33 UTC 2018

Looks like all failures happen within a short time (like a connection to the sql server cannot be established, maybe?). This is not agreeing with the failure mode I saw on Thursday, where the execution died ~20mins into the query (70% of the results were fetched already and it resulted in an incomplete data set).

I'll observe this for a bit longer.

Hm, my log now shows only successful DB queries. Now that I've made my code a bit more failure resistant I think we can just close the ticket for now. If this becomes a problem again I'll open a new ticket. It seems obvious to me that the query killer is not the issue here. The queries that succeed are all in the 25-40min range. Sorry for the disturbance.