We have recently observed a number of failures from Airflow jobs indicating that the maximum number of MariaDB client connections on the analytics_meta server had been reached.
E.g.
Dec 14 17:05:19 an-launcher1002 airflow-scheduler@analytics[20370]: MySQLdb._exceptions.OperationalError: (1040, 'Too many connections')
These seem to have occurred during particularly busy periods, where either Presto or Spark or Hive had been querying the hive metastore at a greater than normal rate.
The following graph shows that over the past 24 hours the number of open connections to the MariaDB server is very close to the maximum.
Red arrows correlate with when the errors were generated.
Sitching the Y-axis to linear (instead of log10) highlights the recent growth in this value, but does not point to a specific cause for this increase in connections.
I propose that we increase the max_connecions parameter for MariaDB, from 250 to 350. This should act as a mitigation against furthe errors of this type, whilst we continue to investigat the precise cause of the increase in open connections.