ERROR 2013 (HY000): Lost connection to MySQL server during query on replicas
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Steinsplitter
	Jun 22 2019, 8:03 AM

Description

After a few minutes, the connection to the MySQL server is lost. This happens on large querys (since a few weeks) and since two(?) days on small query's as well.

Example (large query):

MariaDB [commonswiki_p]> SELECT CURRENT_TIME();
+----------------+
| CURRENT_TIME() |
+----------------+
| 07:34:14       |
+----------------+
1 row in set (0.00 sec)

MariaDB [commonswiki_p]> SET SESSION optimizer_switch='materialization=off';
Query OK, 0 rows affected (0.06 sec)

MariaDB [commonswiki_p]> SELECT concat('File:', p.page_title) AS fqp
    ->     FROM page p
    -> LEFT JOIN templatelinks
    ->     ON tl_title IN ('Uncategorized', 'Uncategorized-Geograph', 'Check_categories', 'Uncategorized-PIB', 'Uncategorized-Polona')
    ->     AND tl_namespace = 10
    ->     AND page_id = tl_from
    -> LEFT JOIN categorylinks
    ->     ON page_id = cl_from
    ->     AND (
    ->         SELECT page_id AS id
    ->         FROM page ppp
    ->         WHERE ppp.page_namespace = 14
    ->             AND ppp.page_title = cl_to
    ->     ) NOT IN (
    ->         SELECT DISTINCT pp_page
    ->         FROM page_props
    ->         WHERE pp_propname IN ('hiddencat')
    ->     )
    -> WHERE p.page_namespace = 6
    ->     AND cl_from IS NULL
    ->     AND tl_from IS NULL
    ->     AND page_is_redirect = 0
    ->     -- group by p.page_id
    ->     -- order by p.page_latest asc
    -> LIMIT 100;
ERROR 2013 (HY000): Lost connection to MySQL server during query
MariaDB [commonswiki_p]> SELECT CURRENT_TIME();
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    35319303
Current database: commonswiki_p

+----------------+
| CURRENT_TIME() |
+----------------+
| 07:39:26       |
+----------------+
1 row in set (4.00 sec)

Example (smaller query):

MariaDB [commonswiki_p]> SELECT CURRENT_TIME();
+----------------+
| CURRENT_TIME() |
+----------------+
| 07:57:40       |
+----------------+
1 row in set (0.00 sec)

MariaDB [commonswiki_p]> SELECT
    ->  actor_name AS user,
    ->  COUNT(log_timestamp) AS ucount
    -> FROM logging_userindex
    -> INNER JOIN actor_logging
    ->  ON log_actor = actor_id
    -> WHERE log_type = 'delete'
    -> AND log_action = 'delete'
    -> GROUP BY actor_name
    -> ORDER by ucount DESC;
ERROR 2013 (HY000): Lost connection to MySQL server during query
MariaDB [commonswiki_p]> SELECT CURRENT_TIME();
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    35396331
Current database: commonswiki_p

+----------------+
| CURRENT_TIME() |
+----------------+
| 08:02:55       |
+----------------+
1 row in set (5.20 sec)

Related Objects

Mentioned In: T226050: Wiki Replicas are very slow and timing out
Mentioned Here: T222978: Compress and defragment tables on labsdb hosts

Event Timeline

Steinsplitter created this task.Jun 22 2019, 8:03 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 22 2019, 8:03 AM

I'm affected by the same problem:

2019-06-22 06:54:04,359 ERROR Subprocess error for itwikisource/Utenti/Utenti_per_numero_di_pagine_create_nel_namespace_principale
Traceback (most recent call last):
  File "/data/project/lists/scheduler.py", line 78, in process_list
    '.analytics.db.svc.eqiad.wmflabs -BN < ' + sql_path + ' > ' + tmp_path, shell=True, check=True)
  File "/usr/lib/python3.5/subprocess.py", line 398, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'mysql --defaults-file=~/replica.my.cnf -h itwikisource.analytics.db.svc.eqiad.wmflabs -BN < /data/project/lists/query/itwikisource/Utenti/Utenti_per_numero_di_pagine_create_nel_namespace_principale.sql > /data/project/lists/output/itwikisource/Utenti/Utenti_per_numero_di_pagine_create_nel_namespace_principale.tmp' returned non-zero exit status 1

JJMC89 removed a project: Toolforge.Jun 22 2019, 9:04 AM

It is because labsdb1010 is serving (temporarily while we do some maintenance on 1011) analytics but still has the query killer set to 300 seconds instead of 14400 (14400 is the one we use for the long query hosts). I have changed it and it is now set to 14400 so it should not be killing those small queries anymore.

Framawiki subscribed.Jun 22 2019, 8:35 PM

• MZMcBride subscribed.Jun 23 2019, 5:43 AM

In T226297#5275652, @Marostegui wrote:

It is because labsdb1010 is serving (temporarily while we do some maintenance on 1011) analytics but still has the query killer set to 300 seconds instead of 14400 (14400 is the one we use for the long query hosts). I have changed it and it is now set to 14400 so it should not be killing those small queries anymore.

Thank you :-).

I also noticed a decrease of the database speed in general on the replicas (on all tables). Is this related to the maintenance as well?

Yeah, essentially we have 3 hosts. Usually only one of them is dedicated to the long queries (analytics) and 2 of the to the web service (fast queries), but due to the maintenance (T222978) we have now 1 host serving analytics which also serves a portion of web, and hence it is more loaded than normal.
This is the change: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/518029/

Edgars2007 subscribed.Jun 28 2019, 12:57 AM

Per T226297#5275652

bd808 mentioned this in T226050: Wiki Replicas are very slow and timing out.Jul 1 2019, 4:01 AM

ERROR 2013 (HY000): Lost connection to MySQL server during query on replicasClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

ERROR 2013 (HY000): Lost connection to MySQL server during query on replicas
Closed, ResolvedPublic
Actions