Page MenuHomePhabricator

m5 ran out of connections after openstack upgrade to "Pike"
Closed, ResolvedPublic

Description

At about Jan 14 23:30 UTC, reports came in that wikitech was having database problems. m5 had run out of database connections again because of a new openstack configuration following upgrades.

It appears to be some new configuration around the neutron workers/sessions that is causing the problem.

Event Timeline

Bstorm created this task.

In the course of this, @JHedden restarted several services, which reduced current connection usage to sane levels, and I set the max_connections on the m5 master to 600 to give more breathing room for troubleshooting (note to @Marostegui and @jcrespo that I did that and don't intend to keep it that way).

It took around 10 hours for this to fill connections previously, so people who are not in convenient timezones should be able to get some sleep until we can get this resolved.

Connections are currently at 340 after the above actions, so we have some wiggle room.

Just after services died (reducing connections a bit), I saw this, so we know it is neutron that is the problem:

+-----------------+----------------+-------------+
| user            | host           | Connections |
+-----------------+----------------+-------------+
| designate       | 208.80.154.11  |          10 |
| designate       | 208.80.154.135 |          11 |
| designate       | All Hosts      |          21 |
| event_scheduler |                |           1 |
| event_scheduler | All Hosts      |           1 |
| glance          | 208.80.154.23  |           2 |
| glance          | All Hosts      |           2 |
| keystone        | 208.80.154.132 |          10 |
| keystone        | 208.80.154.23  |          10 |
| keystone        | All Hosts      |          20 |
| neutron         | 208.80.154.132 |         128 |
| neutron         | 208.80.154.23  |         153 |
| neutron         | All Hosts      |         281 |
| nova            | 208.80.154.132 |          72 |
| nova            | 208.80.154.23  |          13 |
| nova            | All Hosts      |          85 |
| repl            | 10.192.32.187  |           1 |
| repl            | 10.64.0.15     |           1 |
| repl            | All Hosts      |           2 |
| testreduce      | 10.64.48.94    |           2 |
| testreduce      | All Hosts      |           2 |
| watchdog        | 10.64.0.122    |          11 |
| watchdog        | All Hosts      |          11 |
| All Users       | All Hosts      |         425 |
+-----------------+----------------+-------------+

In the course of this, @JHedden restarted several services, which reduced current connection usage to sane levels, and I set the max_connections on the m5 master to 600 to give more breathing room for troubleshooting (note to @Marostegui and @jcrespo that I did that and don't intend to keep it that way).

It took around 10 hours for this to fill connections previously, so people who are not in convenient timezones should be able to get some sleep until we can get this resolved.

Thanks for the heads up Brooke!

It is a bit scary that with each openstack upgrade we seem to be seeing these issues as along with the need of more open connections. Is that something OpenStack assumes? (the fact that with every new version there is the need to keep increasing connections on the database).

It is a bit scary that with each openstack upgrade we seem to be seeing these issues as along with the need of more open connections. Is that something OpenStack assumes? (the fact that with every new version there is the need to keep increasing connections on the database).

Not exactly. We needed to increase for the introduction of cells during a previous upgrade, but in this case, it seems to mostly be configuration and code changes making some workers more aggressive than they were previously. I would not expect the next upgrade to require more connections unless we expand the services we offer...or some changes in the code of openstack requires us to more aggressively restrict the api workers like this one :)

After @Andrew merged that last change, it's looking a bit better.

+-----------------+----------------+-------------+
| user            | host           | Connections |
+-----------------+----------------+-------------+
| designate       | 208.80.154.11  |          10 |
| designate       | 208.80.154.135 |          11 |
| designate       | All Hosts      |          21 |
| event_scheduler |                |           1 |
| event_scheduler | All Hosts      |           1 |
| glance          | 208.80.154.23  |           3 |
| glance          | All Hosts      |           3 |
| keystone        | 208.80.154.132 |          10 |
| keystone        | 208.80.154.23  |          10 |
| keystone        | All Hosts      |          20 |
| neutron         | 208.80.154.132 |          77 |
| neutron         | 208.80.154.23  |          82 |
| neutron         | All Hosts      |         159 |
| nova            | 208.80.154.132 |          57 |
| nova            | 208.80.154.23  |          61 |
| nova            | All Hosts      |         118 |
| repl            | 10.192.32.187  |           1 |
| repl            | 10.64.0.15     |           1 |
| repl            | All Hosts      |           2 |
| testreduce      | 10.64.48.94    |           2 |
| testreduce      | All Hosts      |           2 |
| watchdog        | 10.64.0.122    |          11 |
| watchdog        | All Hosts      |          11 |
| wikiuser        | 208.80.155.109 |           1 |
| wikiuser        | All Hosts      |           1 |
| All Users       | All Hosts      |         338 |
+-----------------+----------------+-------------+

However, it took hours to ramp up yesterday. I'll see how it is doing at the end of my work day.

Still holding at 154 total neutron connections.

Neutron is now at 160, and things seem fairly stable. I'm going to reduce the max_connections again.

Mentioned in SAL (#wikimedia-operations) [2020-01-16T00:40:40Z] <bstorm_> set max_connections on db1133 (m5-master) back to 500 since the neutron connections seem fairly stable now T242817

Bstorm claimed this task.

Neutron has 159 open connections now. I think this is fixed for the time being.