Page MenuHomePhabricator

Increase max_connections for MariaDB on an-coord hosts
Closed, ResolvedPublic

Description

We have recently observed a number of failures from Airflow jobs indicating that the maximum number of MariaDB client connections on the analytics_meta server had been reached.

E.g.

Dec 14 17:05:19 an-launcher1002 airflow-scheduler@analytics[20370]: MySQLdb._exceptions.OperationalError: (1040, 'Too many connections')

These seem to have occurred during particularly busy periods, where either Presto or Spark or Hive had been querying the hive metastore at a greater than normal rate.

The following graph shows that over the past 24 hours the number of open connections to the MariaDB server is very close to the maximum.

image.png (1×3 px, 291 KB)

Red arrows correlate with when the errors were generated.

Sitching the Y-axis to linear (instead of log10) highlights the recent growth in this value, but does not point to a specific cause for this increase in connections.

image.png (1×3 px, 339 KB)

I propose that we increase the max_connecions parameter for MariaDB, from 250 to 350. This should act as a mitigation against furthe errors of this type, whilst we continue to investigat the precise cause of the increase in open connections.

Event Timeline

Change 868373 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Increase max_connections on analytics_meta MariaDB

https://gerrit.wikimedia.org/r/868373

Increased the value at runtime:

btullis@an-coord1001:~$ sudo mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 253447356
Server version: 10.4.18-MariaDB-log MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> SELECT @@GLOBAL.MAX_CONNECTIONS;
+--------------------------+
| @@GLOBAL.MAX_CONNECTIONS |
+--------------------------+
|                      250 |
+--------------------------+
1 row in set (0.000 sec)

MariaDB [(none)]> SET GLOBAL max_connections=350;
Query OK, 0 rows affected (0.000 sec)

MariaDB [(none)]> SELECT @@GLOBAL.MAX_CONNECTIONS;
+--------------------------+
| @@GLOBAL.MAX_CONNECTIONS |
+--------------------------+
|                      350 |
+--------------------------+
1 row in set (0.000 sec)

MariaDB [(none)]>

Confirmed that the increase in max_connections is visible in Grafana too.

image.png (1×3 px, 258 KB)

I will follow up with the same runtime change on an-coord1002 and merge the Puppet CR: https://gerrit.wikimedia.org/r/868373

Changed the value on an-coord1002

btullis@an-coord1002:~$ sudo mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 5814861
Server version: 10.4.18-MariaDB-log MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> SELECT @@GLOBAL.MAX_CONNECTIONS;
+--------------------------+
| @@GLOBAL.MAX_CONNECTIONS |
+--------------------------+
|                      250 |
+--------------------------+
1 row in set (0.000 sec)

MariaDB [(none)]> SET GLOBAL max_connections=350;
Query OK, 0 rows affected (0.000 sec)

MariaDB [(none)]> SELECT @@GLOBAL.MAX_CONNECTIONS;
+--------------------------+
| @@GLOBAL.MAX_CONNECTIONS |
+--------------------------+
|                      350 |
+--------------------------+
1 row in set (0.000 sec)

MariaDB [(none)]>

Change 868373 merged by Btullis:

[operations/puppet@production] Increase max_connections on analytics_meta MariaDB

https://gerrit.wikimedia.org/r/868373

Merged and applied the puppet config change on both an-coord100[1-2]

Notice: /Stage[main]/Mariadb::Config/File[/etc/my.cnf]/content: 
--- /etc/my.cnf	2022-10-04 10:57:48.163740272 +0000
+++ /tmp/puppet-file20221215-5559-5wc41	2022-12-15 12:20:55.922750256 +0000
@@ -58,7 +58,7 @@
 # If running as replica, do not replicate the following databases
 replicate_wild_ignore_table=superset\_staging.%
 
-max_connections = 250
+max_connections = 350
 
 [mysqldump]
 

Info: Computing checksum on file /etc/my.cnf
Info: /Stage[main]/Mariadb::Config/File[/etc/my.cnf]: Filebucketed /etc/my.cnf to puppet with sum 3c4b92bf46568590a56b1c0e31d0ee0b

Now monitoring for any unusual behaviour.