Reports of global connection limit exhaustion for connections to labsdb1003
Closed, DeclinedPublic

Description

guc: "Error: Database error: Unable to connect to meta_p"
Step to reproduce: browse to https://tools.wmflabs.org/guc/?user=119.160.118.126

Jeff_G created this task.Nov 7 2017, 1:10 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 7 2017, 1:10 PM
Wurgl added a subscriber: Wurgl.Nov 7 2017, 2:43 PM

Same here:

MariaDB [dewiki_p]> show processlist;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
ERROR 1040 (HY000): Too many connections
ERROR: Can't connect to the server

or later:

tools.persondata@tools-bastion-03:~$ mysql --user=s51412 --database=dewiki_p --host=dewiki.labsdb --password=<secret>
ERROR 1040 (08004): Too many connections
tools.persondata@tools-bastion-03:~$

As I have seen (with my little knowledge), the maximum number of connections is 1000

MariaDB [dewiki_p]> show variables like "max_connections";
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| max_connections | 1000  |
+-----------------+-------+

but this value seems to be too small.

Krinkle added a subscriber: Krinkle.

I can't reproduce this issue, but I assume this is a general infrastructure issue with the Wiki Replicas and/or Toolforge, and not GUC. @Cloud-Services: Feel free to close if the issue has been resolved since.

Wurgl added a comment.Nov 8 2017, 9:20 AM

In the webserver-logfile of the tool wikihistory I see 2208 entries similar to the following

2017-11-07 11:42:14: (mod_fastcgi.c.2673) FastCGI-stderr: PHP Warning:  mysql_connect(): Too many connections in /mnt/nfs/labstore-secondary-tools-project/wikihistory/db.inc.php on line 7
2017-11-07 14:32:16: (mod_fastcgi.c.2673) FastCGI-stderr: PHP Warning:  mysql_connect(): Too many connections in /mnt/nfs/labstore-secondary-tools-project/wikihistory/db.inc.php on line 7

about 140 of these 2208 entries show the date/time in the range 2017-10-12 15:30:25 … 2017-10-12 15:31:34 all other entries are from 11th of November.

I am sure other tool-maintainer see a similar number of this error and I hope, this is a way to 'reproduce' the error somehow.

bd808 added a subscriber: bd808.Nov 12 2017, 11:25 PM

We may be seeing issues on labsdb1003 due to all users of the *.labsdb service names currently being pinned to that host following the death of labsdb1001. The first fix for any tool would be to switch to the *.{analytics,web}.db.svc.eqiad.wmflabs service names which point to the new Wiki Replica cluster. See https://wikitech.wikimedia.org/wiki/Wiki_Replica_c1_and_c3_shutdown for more information.

For GUC specifically, @Krinkle is blocked on T176886: Update meta_p database for new service names for the conversion to the new servers.

bd808 renamed this task from guc: "Error: Database error: Unable to connect to meta_p" to Reports of global connection limit exhaustion for connections to labsdb1003.Nov 12 2017, 11:26 PM
bd808 updated the task description. (Show Details)
Dispenser added a subscriber: Dispenser.EditedNov 13 2017, 12:12 AM
SELECT @@hostname, @@version, @@max_user_connections, @@max_connections, "Notes       ";
hostnameversionmax_user_connectionsmax_connectionsNotes
labsdb100110.0.22-MariaDB01000Disk failure
labsdb100310.0.22-MariaDB01000Labsdb
labsdb100510.0.31-MariaDB01024tools.labsdb
labsdb100910.1.28-MariaDB101024analytics
labsdb101110.1.28-MariaDB101024web

Toolserver had max_user_connections set to 15. Firefox until June 2017 had 8 connection pipelining, so two users could saturate all your database connections. The new standard HTTP/2 uses no fewer than 100 streams, so eleven users could saturate all connections on the database server if there was no DB user limit.

At this exact point in time, the connections on various servers seem to be well below the absolute limits:

hostnamethreads_connected
labsdb10015
labsdb100377
labsdb100912
labsdb10104
labsdb101114

labsdb1003 however does show that it has seen a max_used_connections value of 1001 at some point since the last restart.

bd808 moved this task from Backlog to Wiki replicas on the Data-Services board.Nov 13 2017, 12:47 AM
Krinkle removed a subscriber: Krinkle.Nov 13 2017, 8:23 PM
bd808 closed this task as Declined.Jan 3 2018, 10:19 PM

Server is being decommissioned.