As I (T76699) and others (T76956) have found over month and years, DB replicas are sometimes prone to disconnect.
- The problem here are not "understandable" disconnections (after a few minutes or so), but immediate/spontaneous ones
- Connections are dropped in second/sub-second times, after the original connection, or after the last SQL command in a program
- I found this in both PHP and the sql command line tool (the mysql command wrapper)
- These things do seem to happen "seasonal", as if a single server gets into a bad state
- An option in php.ini could mitigate the issue, but is apparently not used on Toolforge
- My workaround is to do ping/reconnect at many places in my code, which is horrible, and no guarantee (I've seen the connection being dropped between the successful ping() and the SQL query run on literally the next code line)
Example (just now; commands were entered as fast as I can type):
tools.mix-n-match@tools-bastion-03:~$ sql commonswiki Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 131076513 Server version: 10.0.22-MariaDB MariaDB Server Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [commonswiki_p]> select * from page limit 1; ERROR 2006 (HY000): MySQL server has gone away No connection. Trying to reconnect... Connection id: 131076733 Current database: commonswiki_p +---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+-----------------------+--------------------+-----------+ | page_id | page_namespace | page_title | page_restrictions | page_counter | page_is_redirect | page_is_new | page_random | page_touched | page_links_updated | page_latest | page_len | page_no_title_convert | page_content_model | page_lang | +---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+-----------------------+--------------------+-----------+ | 1 | 0 | Main_Page | | 0 | 0 | 0 | 0.428510002587 | 20171113130155 | 20171113174937 | 255023016 | 6051 | 0 | wikitext | NULL | +---------+----------------+------------+-------------------+--------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+-----------------------+--------------------+-----------+ 1 row in set (0.46 sec)
Now, I've been shrugged off before, but now the connections start to drop while the query is running. Example query for Commons:
select distinct gil_wiki from globalimagelinks
That query runs in the sql tool in 15 seconds, but has repeatedly failed in a script of mine. It is not a long query, and the result is not exactly gigantic (a list of, essentially, all Wikimedia wikis). Right now, GLAM stats have been held up by this (the general issue, not just the query) for days. (The 10 connection limit, hundreds of GLAM groups to check, and no job arrays being another, but that's a different battle front).
I have tried both the "fast" and "slow" new replicas, and have seen the issue on both (and on the previous servers, months/years ago), so it seems to be something systematic.
I am aware that the symptoms are vague at best, and do not lend themselves to easy debugging, which is why I mostly tried to hack around them, but if I can't run small-result fifteen-second queries, my art has reached an end...