"ERROR 2006 (HY000): MySQL server has gone away" failures for a variety of queries against Wiki Replica servers
Closed, ResolvedPublic

Description

As I (T76699) and others (T76956) have found over month and years, DB replicas are sometimes prone to disconnect.

  • The problem here are not "understandable" disconnections (after a few minutes or so), but immediate/spontaneous ones
  • Connections are dropped in second/sub-second times, after the original connection, or after the last SQL command in a program
  • I found this in both PHP and the sql command line tool (the mysql command wrapper)
  • These things do seem to happen "seasonal", as if a single server gets into a bad state
  • An option in php.ini could mitigate the issue, but is apparently not used on Toolforge
  • My workaround is to do ping/reconnect at many places in my code, which is horrible, and no guarantee (I've seen the connection being dropped between the successful ping() and the SQL query run on literally the next code line)

Example (just now; commands were entered as fast as I can type):

tools.mix-n-match@tools-bastion-03:~$ sql commonswiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 131076513
Server version: 10.0.22-MariaDB MariaDB Server

Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [commonswiki_p]> select * from page limit 1;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    131076733
Current database: commonswiki_p

+---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+-----------------------+--------------------+-----------+
| page_id | page_namespace | page_title | page_restrictions | page_counter | page_is_redirect | page_is_new | page_random    | page_touched   | page_links_updated | page_latest | page_len | page_no_title_convert | page_content_model | page_lang |
+---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+-----------------------+--------------------+-----------+
|       1 |              0 | Main_Page  |                   |            0 |                0 |           0 | 0.428510002587 | 20171113130155 | 20171113174937     |   255023016 |     6051 |                     0 | wikitext           | NULL      |
+---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+-----------------------+--------------------+-----------+
1 row in set (0.46 sec)

Now, I've been shrugged off before, but now the connections start to drop while the query is running. Example query for Commons:

select distinct gil_wiki from globalimagelinks

That query runs in the sql tool in 15 seconds, but has repeatedly failed in a script of mine. It is not a long query, and the result is not exactly gigantic (a list of, essentially, all Wikimedia wikis). Right now, GLAM stats have been held up by this (the general issue, not just the query) for days. (The 10 connection limit, hundreds of GLAM groups to check, and no job arrays being another, but that's a different battle front).

I have tried both the "fast" and "slow" new replicas, and have seen the issue on both (and on the previous servers, months/years ago), so it seems to be something systematic.

I am aware that the symptoms are vague at best, and do not lend themselves to easy debugging, which is why I mostly tried to hack around them, but if I can't run small-result fifteen-second queries, my art has reached an end...

Magnus created this task.Nov 13 2017, 5:59 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 13 2017, 5:59 PM
bd808 moved this task from Backlog to Wiki replicas on the Data-Services board.
bd808 renamed this task from Toolforge DB replicas timeout [again] to "ERROR 2006 (HY000): MySQL server has gone away" failures for a variety of queries against Wiki Replica servers.

Before we debug this-- there is (potentially) a misunderstanding, check the line with the ***:

tools.mix-n-match@tools-bastion-03:~$ sql commonswiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 131076513
Server version: 10.0.22-MariaDB MariaDB Server

Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [commonswiki_p]> select * from page limit 1; <-- *** the query didn't fail here, it got disconnected before hand, probably due to inactivity***
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    131076733
Current database: commonswiki_p

It is common to get disconnected if idle on a terminal (that is controlled by parameters, and we disconnect you quite frequently to avoid pileup of idle connections). Sorry if that was understood, I just wanted to start with a common ground.

but now the connections start to drop while the query is running

That is weird. I have manually killed queries of yours- see T133705#3755257 but those were 2-day queries which looked like unintended to run for so long, no 15-second queries.

Sometimes the server crashes, like in T179244 but obviously that is not a common occurrence. Could you be suffering from some outdated connector (unlikely if it is happening on the old server, too)?

The view is simple:

[commonswiki_p]> SHOW CREATE VIEW globalimagelinks\G
*************************** 1. row ***************************
                View: globalimagelinks
         Create View: CREATE ALGORITHM=UNDEFINED DEFINER=`viewmaster`@`%` SQL SECURITY DEFINER VIEW `globalimagelinks` AS select `commonswiki`.`globalimagelinks`.`gil_wiki` AS `gil_wiki`,`commonswiki`.`globalimagelinks`.`gil_page` AS `gil_page`,`commonswiki`.`globalimagelinks`.`gil_page_namespace_id` AS `gil_page_namespace_id`,`commonswiki`.`globalimagelinks`.`gil_page_namespace` AS `gil_page_namespace`,`commonswiki`.`globalimagelinks`.`gil_page_title` AS `gil_page_title`,`commonswiki`.`globalimagelinks`.`gil_to` AS `gil_to` from `commonswiki`.`globalimagelinks`
character_set_client: utf8
collation_connection: utf8_general_ci
1 row in set (0.00 sec)

But the result is not:

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	SIMPLE	globalimagelinks	index		globalimagelinks_wiki	40		343949660	Using index

Are you sure your client can handle 300 million rows?

I am sure my client can handle ~800 rows, which is what I get when I use DISTINCT, as shown...

And yes, I know the disconnect happened before the *** line. In all the ~2sec it took me to type the command.

@madhuvishy @aborrero can you reproduce (I do not have access to forge)?, disconnecting so fast is very weird, and as far as I saw no one ever reported such issues (except the relatively small wait timeout/interactive wait timeout, but not so low).

bd808 added a subscriber: bd808.Nov 13 2017, 9:59 PM

Not a definitive answer, but I seem to be able to make the example query using the sql wrapper (which is just a python script that calls execvp to run /usr/bin/mysql) using a normal Toolforge account. @Magnus's report is that it works for him via that method as well, but is failing from a scripted run related to a tool.

$ sql commonswiki
(u3518@commonswiki.labsdb) [commonswiki_p]> select distinct gil_wiki from globalimagelinks;
+----------------------+
| gil_wiki             |
+----------------------+
|                      |
| aawikibooks          |
...
| zuwiktionary         |
+----------------------+
805 rows in set (21.19 sec)

@Magnus Can you point me to a particular script that you are seeing this drop behavior on commonly? I'd like to see if I can find a way to "reliably" reproduce this intermittent failure. It seems like it could be on the client, the server, or even somewhere in-between, so knowing these things may help us narrow the problem down:

  • Implementation language (PHP, Python, tcl, ...)
  • Is the failing process running on the Job Grid or in Kubernetes?
  • Is the failing process running interactively, via a webservice, or as a cron task?
  • Do we know which node in the Job Grid or Kubernetes is running the process when it fails?
  • Do we know timestamps for any/all of the failures?

I will list some source below, just let me emphasize this is not limited to a specific query or type of query, programming language, tool, or webservice/shell. I even see this in PetScan, which is C++ and runs on its own VM, though it uses the DB replicas as well. Also, it does not seem to be reproducible in general, but tends to "cluster" time-wise, as in, there are some time periods where there is a lot of this, some where it's a rare occurrence. I have kept no log of those periods, just thought of it as the cost of doing business, especially after my initial bug report was declined.

Update: I have added some more "re-run if failed" and "check connection/reconnect" code, and the GLAM data now seem to process more fluently. Unknown if that's because of the new code, or if upstream was fixed.

So the current problem seems to be solved, but this keeps happening and indicates an issue somewhere else, most likely the DBs or some network issue. Would be great if it could be solved, so I don't have to code against a hostile environment all the time...

1978Gage2001 moved this task from Triage to In progress on the DBA board.Dec 11 2017, 9:45 AM
Marostegui moved this task from In progress to Triage on the DBA board.Dec 11 2017, 11:06 AM
jcrespo closed this task as Resolved.Feb 7 2018, 3:36 PM
jcrespo claimed this task.

Sadly, labsdb IS a hostile envirnement, and the way to solve this, which is kill users abusing resources also causes complains, so there is no easy fix. I will consider this fixed, and we can continue talking on how to improve the architecture with a solution that works for all users in the list, wikis and other tickets.

jcrespo reopened this task as Open.EditedMar 16 2018, 1:12 PM

I believe I found the original issue- we recently did a dns failover, and when we do that, we do not expect tools to get the new proxy ips immediately, so we wait a few hours, and then, if there are still some connections left, we kill those manually so they can reconnect to the right IP.

Unlike most of other applications, application s51184 seems to have cached the dns response and not change its entry point, despite having changed the dns hours ago and the TTL being set to 3600. We are not going to cause downtime on purpose, but applications that do not follow correctly rfc1035 will likely suffer some 30-ish minutes of problems, and more issues in the future, as services need to be regularly restarted to get the latest security patches. s51184 is not the only one that does this, but I think it is the most important/the one with more traffic.

I want to contact all developers not respecting the ttls so they can avoid those issues and have 0 downtime.

I, for one, have no idea what s51184 is. It doesn't show in the Toolforge "local" database list:

echo 'show databases;' | sql local | grep 51184

If there is an easy way to find the name of the tool to that, please let me know.

Either way, all my tools (AFAICT) are using the default Toolforge systems (PHP and node.js). I am not aware of any of my tools that requires implementing rfc1035 myself. That would imply it is still a Toolforge bug, though maybe a different department. Which one?

It is tools.catfood numeric id, I only have visibility of db accounts on my layer.

I use https://tools.wmflabs.org/contact/ to find the relation between accounts and owners.

jcrespo closed this task as Resolved.Mar 16 2018, 1:35 PM

That would imply it is still a Toolforge bug

Ok, I will ask someone else.