Page MenuHomePhabricator

ERROR 2013 (HY000): Lost connection to MySQL server during query on replicas
Closed, ResolvedPublic

Description

After a few minutes, the connection to the MySQL server is lost. This happens on large querys (since a few weeks) and since two(?) days on small query's as well.

Example (large query):

MariaDB [commonswiki_p]> SELECT CURRENT_TIME();
+----------------+
| CURRENT_TIME() |
+----------------+
| 07:34:14       |
+----------------+
1 row in set (0.00 sec)

MariaDB [commonswiki_p]> SET SESSION optimizer_switch='materialization=off';
Query OK, 0 rows affected (0.06 sec)

MariaDB [commonswiki_p]> SELECT concat('File:', p.page_title) AS fqp
    ->     FROM page p
    -> LEFT JOIN templatelinks
    ->     ON tl_title IN ('Uncategorized', 'Uncategorized-Geograph', 'Check_categories', 'Uncategorized-PIB', 'Uncategorized-Polona')
    ->     AND tl_namespace = 10
    ->     AND page_id = tl_from
    -> LEFT JOIN categorylinks
    ->     ON page_id = cl_from
    ->     AND (
    ->         SELECT page_id AS id
    ->         FROM page ppp
    ->         WHERE ppp.page_namespace = 14
    ->             AND ppp.page_title = cl_to
    ->     ) NOT IN (
    ->         SELECT DISTINCT pp_page
    ->         FROM page_props
    ->         WHERE pp_propname IN ('hiddencat')
    ->     )
    -> WHERE p.page_namespace = 6
    ->     AND cl_from IS NULL
    ->     AND tl_from IS NULL
    ->     AND page_is_redirect = 0
    ->     -- group by p.page_id
    ->     -- order by p.page_latest asc
    -> LIMIT 100;
ERROR 2013 (HY000): Lost connection to MySQL server during query
MariaDB [commonswiki_p]> SELECT CURRENT_TIME();
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    35319303
Current database: commonswiki_p

+----------------+
| CURRENT_TIME() |
+----------------+
| 07:39:26       |
+----------------+
1 row in set (4.00 sec)

Example (smaller query):

MariaDB [commonswiki_p]> SELECT CURRENT_TIME();
+----------------+
| CURRENT_TIME() |
+----------------+
| 07:57:40       |
+----------------+
1 row in set (0.00 sec)

MariaDB [commonswiki_p]> SELECT
    ->  actor_name AS user,
    ->  COUNT(log_timestamp) AS ucount
    -> FROM logging_userindex
    -> INNER JOIN actor_logging
    ->  ON log_actor = actor_id
    -> WHERE log_type = 'delete'
    -> AND log_action = 'delete'
    -> GROUP BY actor_name
    -> ORDER by ucount DESC;
ERROR 2013 (HY000): Lost connection to MySQL server during query
MariaDB [commonswiki_p]> SELECT CURRENT_TIME();
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    35396331
Current database: commonswiki_p

+----------------+
| CURRENT_TIME() |
+----------------+
| 08:02:55       |
+----------------+
1 row in set (5.20 sec)

Event Timeline

I'm affected by the same problem:

2019-06-22 06:54:04,359 ERROR Subprocess error for itwikisource/Utenti/Utenti_per_numero_di_pagine_create_nel_namespace_principale
Traceback (most recent call last):
  File "/data/project/lists/scheduler.py", line 78, in process_list
    '.analytics.db.svc.eqiad.wmflabs -BN < ' + sql_path + ' > ' + tmp_path, shell=True, check=True)
  File "/usr/lib/python3.5/subprocess.py", line 398, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'mysql --defaults-file=~/replica.my.cnf -h itwikisource.analytics.db.svc.eqiad.wmflabs -BN < /data/project/lists/query/itwikisource/Utenti/Utenti_per_numero_di_pagine_create_nel_namespace_principale.sql > /data/project/lists/output/itwikisource/Utenti/Utenti_per_numero_di_pagine_create_nel_namespace_principale.tmp' returned non-zero exit status 1

It is because labsdb1010 is serving (temporarily while we do some maintenance on 1011) analytics but still has the query killer set to 300 seconds instead of 14400 (14400 is the one we use for the long query hosts). I have changed it and it is now set to 14400 so it should not be killing those small queries anymore.

It is because labsdb1010 is serving (temporarily while we do some maintenance on 1011) analytics but still has the query killer set to 300 seconds instead of 14400 (14400 is the one we use for the long query hosts). I have changed it and it is now set to 14400 so it should not be killing those small queries anymore.

Thank you :-).

I also noticed a decrease of the database speed in general on the replicas (on all tables). Is this related to the maintenance as well?

Yeah, essentially we have 3 hosts. Usually only one of them is dedicated to the long queries (analytics) and 2 of the to the web service (fast queries), but due to the maintenance (T222978) we have now 1 host serving analytics which also serves a portion of web, and hence it is more loaded than normal.
This is the change: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/518029/

bd808 assigned this task to Marostegui.
bd808 subscribed.