Page MenuHomePhabricator

Increase m5 database connection limit for 'nova' database
Open, Needs TriagePublic

Description

For quite some time nova has been pushing up against the connection limit on m5. And we're about to add a few things:

We're about to add a new (required) service, which will may increase the number of worker connections yet again. If we could get the limit boosted to 1.5 or 2x the current number then we'll have room to breath and won't have to spend time balancing worker count vs. connection limits.

(And, I concede that it's weird that nova needs so many connections, but short of re-designing nova entirely I think we need to live with it. It sounds like most other nova deployments use 10x the connection limit we're running).

Event Timeline

Andrew created this task.Mon, Nov 25, 9:45 PM

We might be limiting users accounts in a different way, but the grants assigned to the nova user are not enforcing this:

GRANT ALL PRIVILEGES ON `nova`.* TO 'nova'@'%'

It does look like we're close to reaching max_connections though:

SHOW VARIABLES LIKE "max_connections";
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| max_connections | 500   |
+-----------------+-------+
1 row in set (0.00 sec)
select count(*) from information_schema.processlist;
+----------+
| count(*) |
+----------+
|      427 |
+----------+

The 30 day average number of connections is at 421, which increased slightly from the 90 day average of 379. The increase is likely due to the high availability work where we configured the OpenStack services for active/active.

https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1133&var-port=9104&from=1572130803792&to=1574722803792

I believe we are only limiting connections for nova@208.80.154.132, nova@208.80.154.23 and nova@208.80.154.92` (I don't know if that was intended or a mistake and we should limit all nova connections from any host).
We should probably limit as well:

| nova | labnet1001.eqiad.wmnet |                    0 |
| nova | labnet1002.eqiad.wmnet |                    0 |
| nova | 10.64.20.13            |                    0 |
| nova | 10.64.20.25            |                    0 |
| nova | %                      |                    0 |

OK, we're limiting the nova user to 100 connections from each of the cloud controllers 208.80.154.23, 208.80.154.132.

We're very close to hitting this connection limit today, and the new version of openstack introduces a new placement service that will open more connections with the nova user.

select count(host) from information_schema.processlist where host like '208.80.154.23%' and user = 'nova';
+-------------+
| count(host) |
+-------------+
|          94 |
+-------------+
select count(host) from information_schema.processlist where host like '208.80.154.132%' and user = 'nova';
+-------------+
| count(host) |
+-------------+
|          91 |
+-------------+

I'm not sure about 208.80.154.92, I think it might be an old controller address that is no longer used.

OK, we're limiting the nova user to 100 connections from each of the cloud controllers 208.80.154.23, 208.80.154.132.
We're very close to hitting this connection limit today, and the new version of openstack introduces a new placement service that will open more connections with the nova user.

select count(host) from information_schema.processlist where host like '208.80.154.23%' and user = 'nova';
+-------------+
| count(host) |
+-------------+
|          94 |
+-------------+
select count(host) from information_schema.processlist where host like '208.80.154.132%' and user = 'nova';
+-------------+
| count(host) |
+-------------+
|          91 |
+-------------+

Was that close call something temporarily? ie: an unpredicted spike?

We do have more powerful servers and we can throw some more connections at this. But my impression with the last conversations we had about nova is that they keep requesting more and more connections with every new version and that is not an scalable model I am afraid.

I know that Andrew has been tuning things (thanks!) to reduce the number of connections needed (or at least opened and then being idle) but we really need to think about a solution for this, as keep increasing connections isn't sustainable.
For now, I could increase from 100 to 120 to those two hosts and we should probably also do the same for the other IPs, how does that sound?

I'm not sure about 208.80.154.92, I think it might be an old controller address that is no longer used.

If you can confirm this grant can be deleted, specially if that IP isn't in use by you guys.

Mentioned in SAL (#wikimedia-operations) [2019-12-01T21:39:30Z] <andrewbogott> restarted nova conductor and api on cloudcontrol1003 and 1004 to free up db connections (T239168)