Page MenuHomePhabricator

nova-conductor running out of mysql connections
Closed, ResolvedPublic

Description

In responding to an alert from the nova-fullstack agent, I see that nova-conductor has been failing:

2019-10-08 02:53:39.852 14301 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/strategies.py", line 97, in connect
2019-10-08 02:53:39.852 14301 ERROR oslo_messaging.rpc.server     return dialect.connect(*cargs, **cparams)
2019-10-08 02:53:39.852 14301 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 385, in connect
2019-10-08 02:53:39.852 14301 ERROR oslo_messaging.rpc.server     return self.dbapi.connect(*cargs, **cparams)
2019-10-08 02:53:39.852 14301 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/MySQLdb/__init__.py", line 81, in Connect
2019-10-08 02:53:39.852 14301 ERROR oslo_messaging.rpc.server     return Connection(*args, **kwargs)
2019-10-08 02:53:39.852 14301 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line 204, in __init__
2019-10-08 02:53:39.852 14301 ERROR oslo_messaging.rpc.server     super(Connection, self).__init__(*args, **kwargs2)
2019-10-08 02:53:39.852 14301 ERROR oslo_messaging.rpc.server OperationalError: (_mysql_exceptions.OperationalError) (1226, "User 'nova' has exceeded the 'max_user_connections' resource (current value: 100)")
2019-10-08 02:53:39.852 14301 ERROR oslo_messaging.rpc.server

This happened earlier today when we upgraded to Newton; we cleared all the connections but now it's run out again. We need to reduce the number of conductor workers, increase the number of allowed connections, or find a leak.

Event Timeline

Andrew created this task.Tue, Oct 8, 3:02 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptTue, Oct 8, 3:02 AM
Andrew triaged this task as High priority.Tue, Oct 8, 3:02 AM

Mentioned in SAL (#wikimedia-operations) [2019-10-08T03:03:59Z] <andrewbogott> restarted nova-conductor on cloudcontrol1003 and cloudcontrol1004 — experimental band-aid for T234876

Andrew added a comment.Tue, Oct 8, 3:08 AM

heh, the first forum post I found about this topic suggests raising the connection limit to 2000

Change 541407 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova: try to reduce the number of db connections

https://gerrit.wikimedia.org/r/541407

Andrew added a comment.Tue, Oct 8, 3:18 AM

I'm merging an experimental patch to reduce the number of connections needed. It's possible that this issue was caused by Newton upgrade (and some changein behavior) but it could also be a result of us switching to an HA setup (if the connection limit on the db side is per user/database and not per host/user/database).

Change 541407 merged by Andrew Bogott:
[operations/puppet@production] nova: try to reduce the number of db connections

https://gerrit.wikimedia.org/r/541407

I checked nova-fullstack this morning. Everything looks good. No leaks so far.

Andrew closed this task as Resolved.Thu, Oct 10, 1:48 PM
Andrew claimed this task.