This morning I responded to a full-stack alert; every few hours nova has been failing to schedule new VM creations. This appears to be a result of us running out of DB connections. From nova-conductor logs:
2019-11-03 14:40:17.386 27023 ERROR oslo_messaging.rpc.server OperationalError: (_mysql_exceptions.OperationalError) (1226, "User 'nova' has exceeded the 'max_user_connections' resource (current value: 100)")
I addressed this issue last time by reducing the number of worker nodes (for several services) with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/541407/
That improved things quite a bit, but apparently we still don't have enough headroom. We might be leaking connections, but more likely we just need to raise the limit.