Page MenuHomePhabricator

m5-master overloaded by idle connections to the nova database
Closed, ResolvedPublic

Description

dbproxy1005 at 5:07 marked db1009 as down as a result of a huge load spike:

https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1009&var-network=eth0&from=1519865294031&to=1519886894031&panelId=7&fullscreen

05:07 < icinga-wm> PROBLEM - haproxy failover on dbproxy1005 is CRITICAL: CRITICAL check_failover servers up 2 down 1

As db1009 looked up, I reloaded dbproxy1009 and it started to get heavily pounded.

db1009 (m5 master) is being overloaded since 06:10AM (as soon as HAproxy was reloaded as it marked db1009 as down, probably because of the same overload)

CPU:
https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1009&var-network=eth0&from=1519864556811&to=1519886156812&panelId=7&fullscreen

Connections:
https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1009&var-network=eth0&from=1519864572622&to=1519886172622&panelId=15&fullscreen

Disk utilization:
https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db1009&var-network=eth0&from=1519864586113&to=1519886186113&panelId=19&fullscreen

This is getting the server to reach "Too many connections". I am not sure if this is a cause or a consecuence.
Most of the connections are coming to the nova database and they are sleeping

root@neodymium:~# mysql --skip-ssl -hdb1009 -e "show processlist;" | grep nova | grep -i sleep | wc -l
233
root@neodymium:~# mysql --skip-ssl -hdb1009 -e "show global variables like 'max_connections'"
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| max_connections | 500   |
+-----------------+-------+

Most of the queries are coming from:

  • labcontrol1001
  • labnet1001

HW doesn't look broken:

root@db1009:~# megacli  -LDInfo -Lall -aALL


Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 1.633 TB
Sector Size         : 512
Mirror Data         : 1.633 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives per span:2
Span Depth          : 6
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU



root@db1009:~# megacli -PDList -aALL |grep "Firmware state:"
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up

There are some disks with errors, but those could be old

root@db1009:~# megacli -PDList -aALL |grep "Media"
Media Error Count: 0
Media Type: Hard Disk Device
Media Error Count: 16
Media Type: Hard Disk Device
Media Error Count: 0
Media Type: Hard Disk Device
Media Error Count: 85
Media Type: Hard Disk Device
Media Error Count: 3
Media Type: Hard Disk Device
Media Error Count: 18
Media Type: Hard Disk Device
Media Error Count: 92
Media Type: Hard Disk Device
Media Error Count: 0
Media Type: Hard Disk Device
Media Error Count: 111
Media Type: Hard Disk Device
Media Error Count: 0
Media Type: Hard Disk Device
Media Error Count: 120
Media Type: Hard Disk Device
Media Error Count: 0
Media Type: Hard Disk Device

page that @madhuvishy and then @chasemp responded to:

https://lists.wikimedia.org/pipermail/cloud-admin-feed/2018-March/000007.html
https://lists.wikimedia.org/pipermail/cloud-admin-feed/2018-March/000008.html

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 1 2018, 6:40 AM

I am killing sleeping connections to nova database in a screen on db1009 for now.

Marostegui triaged this task as High priority.Mar 1 2018, 6:44 AM
Marostegui added a project: Operations.
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)Mar 1 2018, 6:48 AM
root@labcontrol1001:~# OS_TENANT_NAME=admin-monitoring openstack server list
+--------------------------------------+-----------------------+--------+---------------------+
| ID                                   | Name                  | Status | Networks            |
+--------------------------------------+-----------------------+--------+---------------------+
| 9327727c-01f7-4155-b6e1-9ceffb364d68 | fullstackd-1519884580 | BUILD  |                     |
| 7977cc27-df85-4436-9e5a-594633862b79 | fullstackd-1519883375 | ERROR  | public=10.68.19.84  |
| 01b857fd-14d9-45ae-874a-9c1873541e0e | fullstackd-1519882649 | ERROR  |                     |
| 36844d38-a06e-487f-932a-49f67a7a7521 | fullstackd-1519881445 | ERROR  | public=10.68.23.32  |
| dfeaf78e-b6bb-4ae9-b0c5-b1c5310bc29d | fullstackd-1519877813 | ERROR  | public=10.68.17.153 |
| 6412c7ef-5018-44e0-bd36-becea59b9d49 | fullstackd-1519875797 | ERROR  |                     |
| 752b68d7-a267-40e0-844c-46899e3fb81b | fullstackd-1519846097 | ERROR  |                     |
+--------------------------------------+-----------------------+--------+---------------------+

Things seem a lot better now since

06:57 madhuvishy: Restart nova-conductor on labcontrol1001
06:59 chasemp: restart nova-api on labnet1001

I am killing sleeping connections to nova database in a screen on db1009 for now.

This was stopped at 06:56AM as @madhuvishy started to restart services.

@madhuvishy restared nova-conductor and I restarted nova-api shortly thereafter. nova-conductor restart seems to have calmed things down. I restarted nova-api as it has a tendency to "get stuck" in weird states when other things cause failure.

I took some patience but post restarts I cleand up nova-fullstack's mess

2007  OS_TENANT_NAME=admin-monitoring openstack server list
2008  OS_TENANT_NAME=admin-monitoring openstack server show 752b68d7-a267-40e0-844c-46899e3fb81b
2009  OS_TENANT_NAME=admin-monitoring openstack server list
2010  OS_TENANT_NAME=admin-monitoring openstack server show 752b68d7-a267-40e0-844c-46899e3fb81b
2011  OS_TENANT_NAME=admin-monitoring openstack server show 7977cc27-df85-4436-9e5a-594633862b79
2012  OS_TENANT_NAME=admin-monitoring openstack server list
2013  nova delete --force 7977cc27-df85-4436-9e5a-594633862b79
2014  nova delete 7977cc27-df85-4436-9e5a-594633862b79
2015  OS_TENANT_NAME=admin-monitoring openstack server list
2016  nova delete dfeaf78e-b6bb-4ae9-b0c5-b1c5310bc29d
2017  nova delete 36844d38-a06e-487f-932a-49f67a7a7521
2018  OS_TENANT_NAME=admin-monitoring openstack server lis
2019  OS_TENANT_NAME=admin-monitoring openstack server list
2020  nova delete 9327727c-01f7-4155-b6e1-9ceffb364d68
2021  OS_TENANT_NAME=admin-monitoring openstack server list
2022  nova --reset-state 01b857fd-14d9-45ae-874a-9c1873541e0e
2023  OS_TENANT_NAME=admin-monitoring openstack server list
2024  nova reset-state --active 01b857fd-14d9-45ae-874a-9c1873541e0e
2025  OS_TENANT_NAME=admin-monitoring openstack server list
2026  nova delete 01b857fd-14d9-45ae-874a-9c1873541e0e
2027*
2028  nova reset-state --active 6412c7ef-5018-44e0-bd36-becea59b9d49
2029  OS_TENANT_NAME=admin-monitoring openstack server list
2030  nova reset-state --active 752b68d7-a267-40e0-844c-46899e3fb81b
2031  OS_TENANT_NAME=admin-monitoring openstack server list
2032  nova delete 6412c7ef-5018-44e0-bd36-becea59b9d49
2033  nova delete 752b68d7-a267-40e0-844c-46899e3fb81b
2034  OS_TENANT_NAME=admin-monitoring openstack server list
Marostegui lowered the priority of this task from High to Normal.Mar 1 2018, 7:18 AM

Decreasing the task back to Normal priority as things look stable and leaving it open as per:

˜/chasemp 8:16> marostegui: no worries and let's leave it open till monday because somethign has to be found or changed, want to talk to andrew
Marostegui moved this task from Triage to In progress on the DBA board.Mar 1 2018, 7:19 AM
root@labcontrol1001:~# OS_TENANT_NAME=contintcloud openstack server list
+--------------------------------------+----------------------------+--------+---------------------+
| ID                                   | Name                       | Status | Networks            |
+--------------------------------------+----------------------------+--------+---------------------+
| a3242124-c5c0-407c-a517-6a074e53727a | ci-jessie-wikimedia-981681 | BUILD  |                     |
| 19147c8b-f062-4d97-8df1-376189fd09cc | ci-jessie-wikimedia-981679 | ACTIVE | public=10.68.21.158 |
| 763bfe0b-3e07-4930-a380-904f8159df50 | ci-jessie-wikimedia-981677 | ACTIVE | public=10.68.18.255 |
| bb9f2a5a-1abc-42fd-8808-2b0ac809c85c | ci-jessie-wikimedia-981673 | ACTIVE | public=10.68.23.224 |
| a9924d02-99bd-4eab-bcd9-9b53ab43bb73 | ci-jessie-wikimedia-981558 | BUILD  |                     |
| 36bf1019-1cd0-4c07-afd2-f69b70116072 | ci-jessie-wikimedia-981552 | BUILD  |                     |
| 01e5e1d9-efa7-4a15-9ab2-b562164434a2 | ci-jessie-wikimedia-981547 | BUILD  |                     |
| b2e14820-4aed-479b-ae88-e6c325602df1 | ci-jessie-wikimedia-981546 | BUILD  |                     |
| e51d2261-9566-4093-afa1-ba83c45095da | ci-jessie-wikimedia-981545 | BUILD  |                     |
| c66fc2ae-82c4-40a1-9200-792bc7ed6604 | ci-jessie-wikimedia-981544 | BUILD  |                     |
| 5f293702-9ba8-4474-91ae-b299abd72fe4 | ci-jessie-wikimedia-981542 | ERROR  | public=10.68.18.24  |
| b0e32df9-b60c-44b6-89f3-74e6977b8f9f | ci-jessie-wikimedia-981539 | ERROR  |                     |
| 26ddb60d-a928-498e-8f45-920c28e4afe4 | ci-jessie-wikimedia-981531 | ERROR  | public=10.68.22.217 |
| 5545f305-2f8c-4ed0-8b79-16b34039132f | ci-jessie-wikimedia-981528 | ERROR  | public=10.68.21.226 |
| f0048bf6-2755-4d18-b5e7-abdd2c5e0362 | ci-jessie-wikimedia-981527 | ERROR  | public=10.68.23.119 |
| 0d10c249-f8a5-4e44-b1da-20e83b791b3c | ci-jessie-wikimedia-981526 | ERROR  | public=10.68.19.175 |
| 9ce195fc-636d-4b12-a5c2-286a318d68fd | ci-jessie-wikimedia-981525 | BUILD  |                     |
| 3c0f9886-289a-44dd-9e2f-e26ec9210fd4 | ci-jessie-wikimedia-981519 | ERROR  | public=10.68.20.30  |
| 43886ab5-8cc9-44b4-930c-213c0b2be68a | ci-jessie-wikimedia-981518 | ERROR  | public=10.68.18.121 |
| c19770d3-40a3-4252-a7aa-eddbc94cb753 | ci-jessie-wikimedia-981512 | ERROR  | public=10.68.21.21  |
| 44b86f6e-be09-472a-bb05-93bc8f312a47 | ci-jessie-wikimedia-981511 | ERROR  | public=10.68.23.26  |
| d29a03fd-2245-40cf-8a3f-16f2bf343cc6 | ci-jessie-wikimedia-981510 | ERROR  | public=10.68.18.127 |
| 284ae1d1-7254-48ca-a665-0b5b6c78e502 | ci-jessie-wikimedia-981509 | ERROR  | public=10.68.18.62  |
| c189f21c-fde9-45b2-98ec-dbb0ff03515b | ci-jessie-wikimedia-981504 | ERROR  | public=10.68.20.36  |
+--------------------------------------+----------------------------+--------+---------------------+

Some logs from nova-conductor corresponding to the time of incident, doesn't seem like the root cause but correlates with the db spike. https://phabricator.wikimedia.org/P6770

This was warned in advance at T188210

The immediate solution is T183469

Andrew added a subscriber: Andrew.Mar 1 2018, 1:33 PM

Sorry I slept through this last night! I'm catching up. A few facts:

nova-api seems to connect directly to the database. Other than nova-api, nova-conductor is the service that marshals all nova database calls. So any nova service (other than the api) could've been misbehaving and it would manifest as nova-conductor misbehaving.

The db behavior of these two services can be configured in a few ways... here are our options:

[api_database]
connection = None 	(StrOpt) The SQLAlchemy connection string to use to connect to the Nova API database.
connection_debug = 0 	(IntOpt) Verbosity of SQL debugging information: 0=None, 100=Everything.
connection_trace = False 	(BoolOpt) Add Python stack traces to SQL as comment strings.
idle_timeout = 3600 	(IntOpt) Timeout before idle SQL connections are reaped.
max_overflow = None 	(IntOpt) If set, use this value for max_overflow with SQLAlchemy.
max_pool_size = None 	(IntOpt) Maximum number of SQL connections to keep open in a pool.
max_retries = 10 	(IntOpt) Maximum number of database connection retries during startup. Set to -1 to specify an infinite retry count.
mysql_sql_mode = TRADITIONAL 	(StrOpt) The SQL mode to be used for MySQL sessions. This option, including the default, overrides any server-set SQL mode. To use whatever SQL mode is set by the server configuration, set this to no value. Example: mysql_sql_mode=
pool_timeout = None 	(IntOpt) If set, use this value for pool_timeout with SQLAlchemy.
retry_interval = 10 	(IntOpt) Interval between retries of opening a SQL connection.
slave_connection = None 	(StrOpt) The SQLAlchemy connection string to use to connect to the slave database.
sqlite_synchronous = True 	(BoolOpt) If True, SQLite uses synchronous mode.


[database]
backend = sqlalchemy 	(StrOpt) The back end to use for the database.
connection = None 	(StrOpt) The SQLAlchemy connection string to use to connect to the database.
connection_debug = 0 	(IntOpt) Verbosity of SQL debugging information: 0=None, 100=Everything.
connection_trace = False 	(BoolOpt) Add Python stack traces to SQL as comment strings.
db_inc_retry_interval = True 	(BoolOpt) If True, increases the interval between retries of a database operation up to db_max_retry_interval.
db_max_retries = 20 	(IntOpt) Maximum retries in case of connection error or deadlock error before error is raised. Set to -1 to specify an infinite retry count.
db_max_retry_interval = 10 	(IntOpt) If db_inc_retry_interval is set, the maximum seconds between retries of a database operation.
db_retry_interval = 1 	(IntOpt) Seconds between retries of a database transaction.
idle_timeout = 3600 	(IntOpt) Timeout before idle SQL connections are reaped.
max_overflow = None 	(IntOpt) If set, use this value for max_overflow with SQLAlchemy.
max_pool_size = None 	(IntOpt) Maximum number of SQL connections to keep open in a pool.
max_retries = 10 	(IntOpt) Maximum number of database connection retries during startup. Set to -1 to specify an infinite retry count.
min_pool_size = 1 	(IntOpt) Minimum number of SQL connections to keep open in a pool.
mysql_sql_mode = TRADITIONAL 	(StrOpt) The SQL mode to be used for MySQL sessions. This option, including the default, overrides any server-set SQL mode. To use whatever SQL mode is set by the server configuration, set this to no value. Example: mysql_sql_mode=
pool_timeout = None 	(IntOpt) If set, use this value for pool_timeout with SQLAlchemy.
retry_interval = 10 	(IntOpt) Interval between retries of opening a SQL connection.
slave_connection = None 	(StrOpt) The SQLAlchemy connection string to use to connect to the slave database.
sqlite_db = oslo.sqlite 	(StrOpt) The file name to use with SQLite.
sqlite_synchronous = True 	(BoolOpt) If True, SQLite uses synchronous mode.
use_db_reconnect = False 	(BoolOpt) Enable the experimental use of database reconnect on connection lost.
use_tpool = False 	(BoolOpt) Enable the experimental use of thread pooling for all DB API calls

If the issue is too many connections, I suggest that we reduce idle_timeout, increase db_inc_retry_interval, and set max_pool_size to something other than 'None'. @jcrespo, what do you think? What would would you suggest as a value for max_pool_size?

The problem I see is that each openstack (I think it is openstack) application, has its own pool of connections- which has some issues for our infrastructure- first because it "reserves" resources that cannot be used for other applications- ok if it is a dedicated database, but not ok if it is shared with other services; plus they end up adding up- db1009 is mostly idle, but there are 300 open connections.

Because we also do connection polling at server side, it means that new connections may wait on queue to be executed.
Also, if the server fails, normally connections are not killed immediately, even if the service is automatically failed over through a proxy.

All of this I think is secondary- the main issue is to identify the causes of the overload I started seeing on T188210, for which I pinged you some days ago. Also, we should speed up the failover of db1009 to a newer host.

Later we can see what is the best way of scaling this, or if we should separate some services on its own mariadb instance (if m5 starts to be highly loaded).

Adding @bd808 so he gives the right priority of this to his team.

This was in my email twice from last night. I am suspicious of this cron, but unsure if (part of) cause or effect really.

12:40 AM (7 hours ago) (my local time)

Cron <root@labcontrol1001> /usr/bin/mysql keystone -hm5-master.eqiad.wmnet -ukeystone -p<redacted> -e 'DELETE FROM token WHERE user_id="novaadmin" AND NOW() + INTERVAL 7 day > expires LIMIT 10000;'

ERROR 1040 (08004): Too many connections

This was in my email twice from last night. I am suspicious of this cron, but unsure if (part of) cause or effect really.

12:40 AM (7 hours ago) (my local time)

Cron <root@labcontrol1001> /usr/bin/mysql keystone -hm5-master.eqiad.wmnet -ukeystone -p<redacted> -e 'DELETE FROM token WHERE user_id="novaadmin" AND NOW() + INTERVAL 7 day > expires LIMIT 10000;'

ERROR 1040 (08004): Too many connections

Is this something that would be more safely done with keystone-manage token_flush or is that unrelated?

chasemp updated the task description. (Show Details)Mar 1 2018, 3:01 PM

I agree with Jaime here - it is key to find what is causing this overload.
Even though we have to replace this old host, we really need to find out what is causing this overload, otherwise, a host replacement will help, for sure, but in the end, we will reach the same problem eventually.
Once the cause has been found out, we can probably evaluate if this needs its own set of resources or it can live in m5 (if we manage to have it under control).

bd808 renamed this task from db1009 overloaded to db1009 overloaded by idle connections to the nova database.Mar 1 2018, 4:40 PM
jcrespo added a comment.EditedMar 1 2018, 4:46 PM

"by idle connections to the nova database"

I don't think that is accurate- that is making things worse, but probably not the root cause. max_connections is 500, which is low given the 300 idle connections, but something else had to do extra 200 connections.

Mitigations could be put in place by establishing stricter per-user limits.

Andrew added a comment.Mar 1 2018, 4:55 PM

Is this something that would be more safely done with keystone-manage token_flush or is that unrelated?

These are two different things. Token_flush flushes expired tokens. The DELETE FROM command is clearing up tokens which have not yet expired but were known to be used as one-offs so are no longer useful.

Change 415619 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova.conf: adjust db pool settings for all services

https://gerrit.wikimedia.org/r/415619

bd808 added a comment.Mar 3 2018, 2:12 AM

Traffic to db1009 spiked like crazy again around 2018-03-03T00:30Z. This time things backed up so far that OpenStack started failing (first noticed via nodepool). Eventually @Andrew and @chasemp figured out that all three of silver, labweb1001, and labweb1002 were running a database back job from cron.

Traffic to db1009 spiked like crazy again around 2018-03-03T00:30Z. This time things backed up so far that OpenStack started failing (first noticed via nodepool). Eventually @Andrew and @chasemp figured out that all three of silver, labweb1001, and labweb1002 were running a database back job from cron.

I stopped puppet on the labwebs and cleaned out the cron there

Marostegui closed this task as Resolved.Mar 14 2018, 6:52 AM

I am going to consider this resolved for now, as it hasn't happened again.
Thanks everyone involved in getting this fixed!

Change 415619 merged by Andrew Bogott:
[operations/puppet@production] nova.conf: adjust db pool settings for all services

https://gerrit.wikimedia.org/r/415619

chasemp reopened this task as Open.Mar 16 2018, 2:24 PM

@Andrew tried to merge the change to allow nova to be more gracious and it didn't work out.

https://phabricator.wikimedia.org/P6853

To be fair, I had before my doubts about nova being the cause, and pilups due to backups seemed a more reasonable explanation. Also, no other examples that I could see happened since a few weeks ago (and now we have a larger server backing m5). My suggestion is to close this without touching nova and reevaluate only if it happens again.

Andrew closed this task as Resolved.Mar 16 2018, 3:14 PM
Andrew claimed this task.

My suggestion is to close this without touching nova

Works for me!

Don't celebrate yet too hard, as it will increase the chances of the issue happening again :-D

jcrespo reopened this task as Open.Aug 20 2018, 12:44 PM
jcrespo added a comment.EditedAug 20 2018, 12:47 PM
MariaDB [(none)]> select user, host, count(*) FROM information_schema.processlist GROUP BY USER, HOST;
+-----------------+----------------------+----------+
| user            | host                 | count(*) |
+-----------------+----------------------+----------+
| designate       | 208.80.154.135:37438 |        1 |
| designate       | 208.80.154.135:37440 |        1 |
| designate       | 208.80.155.117:33209 |        1 |
| designate       | 208.80.155.117:33210 |        1 |
| designate       | 208.80.155.117:33211 |        1 |
| designate       | 208.80.155.117:33212 |        1 |
| designate       | 208.80.155.117:33213 |        1 |
| designate       | 208.80.155.117:33339 |        1 |
| designate       | 208.80.155.117:33380 |        1 |
| designate       | 208.80.155.117:33408 |        1 |
| designate       | 208.80.155.117:33411 |        1 |
| designate       | 208.80.155.117:33428 |        1 |
| event_scheduler | localhost            |        1 |
| glance          | 208.80.154.23:33264  |        1 |
| glance          | 208.80.154.23:33630  |        1 |
| glance          | 208.80.154.23:33638  |        1 |
| glance          | 208.80.154.23:33650  |        1 |
| glance          | 208.80.154.23:33814  |        1 |
| glance          | 208.80.154.23:33880  |        1 |
| keystone        | 208.80.154.23:33070  |        1 |
| keystone        | 208.80.154.23:33072  |        1 |
| keystone        | 208.80.154.23:33074  |        1 |
| keystone        | 208.80.154.23:33076  |        1 |
| keystone        | 208.80.154.23:33078  |        1 |
| keystone        | 208.80.154.23:33080  |        1 |
| keystone        | 208.80.154.23:33082  |        1 |
| keystone        | 208.80.154.23:33084  |        1 |
| keystone        | 208.80.154.23:33086  |        1 |
| keystone        | 208.80.154.23:33088  |        1 |
| keystone        | 208.80.154.23:33090  |        1 |
| keystone        | 208.80.154.23:33092  |        1 |
| keystone        | 208.80.154.23:33094  |        1 |
| keystone        | 208.80.154.23:33098  |        1 |
| keystone        | 208.80.154.23:33100  |        1 |
| keystone        | 208.80.154.23:33102  |        1 |
| keystone        | 208.80.154.23:33104  |        1 |
| keystone        | 208.80.154.23:33106  |        1 |
| keystone        | 208.80.154.23:33108  |        1 |
| keystone        | 208.80.154.23:33110  |        1 |
| keystone        | 208.80.154.23:33112  |        1 |
| keystone        | 208.80.154.23:33114  |        1 |
| keystone        | 208.80.154.23:33116  |        1 |
| keystone        | 208.80.154.23:33118  |        1 |
| keystone        | 208.80.154.23:33120  |        1 |
| keystone        | 208.80.154.23:33122  |        1 |
| keystone        | 208.80.154.23:33124  |        1 |
| keystone        | 208.80.154.23:33126  |        1 |
| keystone        | 208.80.154.23:33128  |        1 |
| keystone        | 208.80.154.23:33130  |        1 |
| keystone        | 208.80.154.23:33134  |        1 |
| keystone        | 208.80.154.23:33136  |        1 |
| keystone        | 208.80.154.23:33138  |        1 |
| keystone        | 208.80.154.23:33140  |        1 |
| keystone        | 208.80.154.23:33142  |        1 |
| keystone        | 208.80.154.23:33144  |        1 |
| keystone        | 208.80.154.23:33146  |        1 |
| keystone        | 208.80.154.23:33148  |        1 |
| keystone        | 208.80.154.23:33150  |        1 |
| keystone        | 208.80.154.23:33152  |        1 |
| keystone        | 208.80.154.23:33154  |        1 |
| keystone        | 208.80.154.23:33156  |        1 |
| keystone        | 208.80.154.23:33158  |        1 |
| keystone        | 208.80.154.23:33160  |        1 |
| keystone        | 208.80.154.23:33162  |        1 |
| keystone        | 208.80.154.23:33164  |        1 |
| keystone        | 208.80.154.23:33166  |        1 |
| keystone        | 208.80.154.23:33168  |        1 |
| keystone        | 208.80.154.23:33170  |        1 |
| keystone        | 208.80.154.23:33172  |        1 |
| keystone        | 208.80.154.23:33174  |        1 |
| keystone        | 208.80.154.23:33176  |        1 |
| keystone        | 208.80.154.23:33178  |        1 |
| keystone        | 208.80.154.23:33180  |        1 |
| keystone        | 208.80.154.23:33182  |        1 |
| keystone        | 208.80.154.23:33184  |        1 |
| keystone        | 208.80.154.23:33186  |        1 |
| keystone        | 208.80.154.23:33188  |        1 |
| keystone        | 208.80.154.23:33190  |        1 |
| keystone        | 208.80.154.23:33192  |        1 |
| keystone        | 208.80.154.23:33198  |        1 |
| keystone        | 208.80.154.23:33200  |        1 |
| keystone        | 208.80.154.23:33202  |        1 |
| keystone        | 208.80.154.23:33204  |        1 |
| keystone        | 208.80.154.23:33206  |        1 |
| keystone        | 208.80.154.23:33208  |        1 |
| keystone        | 208.80.154.23:33210  |        1 |
| keystone        | 208.80.154.23:33212  |        1 |
| keystone        | 208.80.154.23:33214  |        1 |
| keystone        | 208.80.154.23:33218  |        1 |
| keystone        | 208.80.154.23:33220  |        1 |
| keystone        | 208.80.154.23:33244  |        1 |
| keystone        | 208.80.154.23:33246  |        1 |
| keystone        | 208.80.154.23:33248  |        1 |
| keystone        | 208.80.154.23:33250  |        1 |
| keystone        | 208.80.154.23:33254  |        1 |
| keystone        | 208.80.154.23:33256  |        1 |
| keystone        | 208.80.154.23:33258  |        1 |
| keystone        | 208.80.154.23:33270  |        1 |
| keystone        | 208.80.154.23:33272  |        1 |
| keystone        | 208.80.154.23:33288  |        1 |
| keystone        | 208.80.154.23:33290  |        1 |
| keystone        | 208.80.154.23:33292  |        1 |
| keystone        | 208.80.154.23:33294  |        1 |
| keystone        | 208.80.154.23:33300  |        1 |
| keystone        | 208.80.154.23:33302  |        1 |
| keystone        | 208.80.154.23:33306  |        1 |
| keystone        | 208.80.154.23:33308  |        1 |
| keystone        | 208.80.154.23:33314  |        1 |
| keystone        | 208.80.154.23:33316  |        1 |
| keystone        | 208.80.154.23:33318  |        1 |
| keystone        | 208.80.154.23:33320  |        1 |
| keystone        | 208.80.154.23:33322  |        1 |
| keystone        | 208.80.154.23:33326  |        1 |
| keystone        | 208.80.154.23:33328  |        1 |
| keystone        | 208.80.154.23:33330  |        1 |
| keystone        | 208.80.154.23:33332  |        1 |
| keystone        | 208.80.154.23:33342  |        1 |
| keystone        | 208.80.154.23:33344  |        1 |
| keystone        | 208.80.154.23:33346  |        1 |
| keystone        | 208.80.154.23:33348  |        1 |
| keystone        | 208.80.154.23:33354  |        1 |
| keystone        | 208.80.154.23:33356  |        1 |
| keystone        | 208.80.154.23:33360  |        1 |
| keystone        | 208.80.154.23:33362  |        1 |
| keystone        | 208.80.154.23:33364  |        1 |
| keystone        | 208.80.154.23:33366  |        1 |
| keystone        | 208.80.154.23:33368  |        1 |
| keystone        | 208.80.154.23:33372  |        1 |
| keystone        | 208.80.154.23:33378  |        1 |
| keystone        | 208.80.154.23:33380  |        1 |
| keystone        | 208.80.154.23:33386  |        1 |
| keystone        | 208.80.154.23:33388  |        1 |
| keystone        | 208.80.154.23:33392  |        1 |
| keystone        | 208.80.154.23:33394  |        1 |
| keystone        | 208.80.154.23:33396  |        1 |
| keystone        | 208.80.154.23:33398  |        1 |
| keystone        | 208.80.154.23:33400  |        1 |
| keystone        | 208.80.154.23:33406  |        1 |
| keystone        | 208.80.154.23:33408  |        1 |
| keystone        | 208.80.154.23:33410  |        1 |
| keystone        | 208.80.154.23:33412  |        1 |
| keystone        | 208.80.154.23:33414  |        1 |
| keystone        | 208.80.154.23:33418  |        1 |
| keystone        | 208.80.154.23:33420  |        1 |
| keystone        | 208.80.154.23:33426  |        1 |
| keystone        | 208.80.154.23:33428  |        1 |
| keystone        | 208.80.154.23:33432  |        1 |
| keystone        | 208.80.154.23:33434  |        1 |
| keystone        | 208.80.154.23:33440  |        1 |
| keystone        | 208.80.154.23:33442  |        1 |
| keystone        | 208.80.154.23:33444  |        1 |
| keystone        | 208.80.154.23:33446  |        1 |
| keystone        | 208.80.154.23:33454  |        1 |
| keystone        | 208.80.154.23:33456  |        1 |
| keystone        | 208.80.154.23:33458  |        1 |
| keystone        | 208.80.154.23:33460  |        1 |
| keystone        | 208.80.154.23:33466  |        1 |
| keystone        | 208.80.154.23:33468  |        1 |
| keystone        | 208.80.154.23:33472  |        1 |
| keystone        | 208.80.154.23:33474  |        1 |
| keystone        | 208.80.154.23:33482  |        1 |
| keystone        | 208.80.154.23:33484  |        1 |
| keystone        | 208.80.154.23:33506  |        1 |
| keystone        | 208.80.154.23:33508  |        1 |
| keystone        | 208.80.154.23:33514  |        1 |
| keystone        | 208.80.154.23:33516  |        1 |
| keystone        | 208.80.154.23:33532  |        1 |
| keystone        | 208.80.154.23:33534  |        1 |
| keystone        | 208.80.154.23:33546  |        1 |
| keystone        | 208.80.154.23:33548  |        1 |
| keystone        | 208.80.154.23:33558  |        1 |
| keystone        | 208.80.154.23:33560  |        1 |
| keystone        | 208.80.154.23:33562  |        1 |
| keystone        | 208.80.154.23:33566  |        1 |
| keystone        | 208.80.154.23:33568  |        1 |
| keystone        | 208.80.154.23:33580  |        1 |
| keystone        | 208.80.154.23:33582  |        1 |
| keystone        | 208.80.154.23:33590  |        1 |
| keystone        | 208.80.154.23:33592  |        1 |
| keystone        | 208.80.154.23:33614  |        1 |
| keystone        | 208.80.154.23:33616  |        1 |
| keystone        | 208.80.154.23:33662  |        1 |
| keystone        | 208.80.154.23:33664  |        1 |
| keystone        | 208.80.154.23:33694  |        1 |
| keystone        | 208.80.154.23:33696  |        1 |
| keystone        | 208.80.154.23:33726  |        1 |
| keystone        | 208.80.154.23:33728  |        1 |
| keystone        | 208.80.154.23:33756  |        1 |
| keystone        | 208.80.154.23:33758  |        1 |
| keystone        | 208.80.154.23:33768  |        1 |
| keystone        | 208.80.154.23:33770  |        1 |
| keystone        | 208.80.154.23:33782  |        1 |
| keystone        | 208.80.154.23:33784  |        1 |
| keystone        | 208.80.154.23:33816  |        1 |
| keystone        | 208.80.154.23:33818  |        1 |
| keystone        | 208.80.154.23:33866  |        1 |
| keystone        | 208.80.154.23:33868  |        1 |
| keystone        | 208.80.154.23:33884  |        1 |
| keystone        | 208.80.154.23:33886  |        1 |
| keystone        | 208.80.154.23:33904  |        1 |
| keystone        | 208.80.154.23:33906  |        1 |
| keystone        | 208.80.154.23:33908  |        1 |
| keystone        | 208.80.154.23:33922  |        1 |
| keystone        | 208.80.154.23:33924  |        1 |
| keystone        | 208.80.154.23:33928  |        1 |
| keystone        | 208.80.154.23:33930  |        1 |
| neutron         | 208.80.154.132:47544 |        1 |
| neutron         | 208.80.154.23:33194  |        1 |
| neutron         | 208.80.154.23:33222  |        1 |
| neutron         | 208.80.154.23:33224  |        1 |
| neutron         | 208.80.154.23:33226  |        1 |
| neutron         | 208.80.154.23:33228  |        1 |
| neutron         | 208.80.154.23:33230  |        1 |
| neutron         | 208.80.154.23:33232  |        1 |
| neutron         | 208.80.154.23:33234  |        1 |
| neutron         | 208.80.154.23:33236  |        1 |
| neutron         | 208.80.154.23:33238  |        1 |
| neutron         | 208.80.154.23:33240  |        1 |
| neutron         | 208.80.154.23:33242  |        1 |
| neutron         | 208.80.154.23:33266  |        1 |
| neutron         | 208.80.154.23:33268  |        1 |
| neutron         | 208.80.154.23:33276  |        1 |
| neutron         | 208.80.154.23:33888  |        1 |
| neutron         | 208.80.154.23:33890  |        1 |
| neutron         | 208.80.154.23:33894  |        1 |
| neutron         | 208.80.154.23:33896  |        1 |
| neutron         | 208.80.154.23:33898  |        1 |
| neutron         | 208.80.154.23:33900  |        1 |
| neutron         | 208.80.154.23:33918  |        1 |
| neutron         | 208.80.154.23:33920  |        1 |
| neutron         | 208.80.154.23:33926  |        1 |
| neutron         | 208.80.154.23:33932  |        1 |
| neutron         | 208.80.154.23:33934  |        1 |
| neutron         | 208.80.154.23:33938  |        1 |
| nodepool        | 10.64.20.18:47378    |        1 |
| nodepool        | 10.64.20.18:47382    |        1 |
| nodepool        | 10.64.20.18:47384    |        1 |
| nodepool        | 10.64.20.18:47388    |        1 |
| nodepool        | 10.64.20.18:47390    |        1 |
| nova            | 10.64.20.13:47184    |        1 |
| nova            | 10.64.20.13:47186    |        1 |
| nova            | 10.64.20.13:47188    |        1 |
| nova            | 10.64.20.13:47190    |        1 |
| nova            | 10.64.20.13:47192    |        1 |
| nova            | 10.64.20.13:47194    |        1 |
| nova            | 10.64.20.13:47196    |        1 |
| nova            | 10.64.20.13:47198    |        1 |
| nova            | 10.64.20.13:47200    |        1 |
| nova            | 10.64.20.13:47202    |        1 |
| nova            | 10.64.20.13:47206    |        1 |
| nova            | 10.64.20.13:47208    |        1 |
| nova            | 10.64.20.13:47210    |        1 |
| nova            | 10.64.20.13:47212    |        1 |
| nova            | 10.64.20.13:47214    |        1 |
| nova            | 10.64.20.13:47216    |        1 |
| nova            | 10.64.20.13:47218    |        1 |
| nova            | 10.64.20.13:47220    |        1 |
| nova            | 10.64.20.13:47222    |        1 |
| nova            | 10.64.20.13:47224    |        1 |
| nova            | 10.64.20.13:47226    |        1 |
| nova            | 10.64.20.13:47228    |        1 |
| nova            | 10.64.20.13:47230    |        1 |
| nova            | 10.64.20.13:47232    |        1 |
| nova            | 10.64.20.13:47234    |        1 |
| nova            | 10.64.20.13:47236    |        1 |
| nova            | 10.64.20.13:47238    |        1 |
| nova            | 10.64.20.13:47240    |        1 |
| nova            | 10.64.20.13:47242    |        1 |
| nova            | 10.64.20.13:47244    |        1 |
| nova            | 10.64.20.13:47246    |        1 |
| nova            | 10.64.20.13:47248    |        1 |
| nova            | 10.64.20.13:47250    |        1 |
| nova            | 10.64.20.13:47252    |        1 |
| nova            | 10.64.20.13:47254    |        1 |
| nova            | 10.64.20.13:47256    |        1 |
| nova            | 10.64.20.13:47258    |        1 |
| nova            | 10.64.20.13:47260    |        1 |
| nova            | 10.64.20.13:47262    |        1 |
| nova            | 10.64.20.13:47264    |        1 |
| nova            | 10.64.20.13:47266    |        1 |
| nova            | 10.64.20.13:47268    |        1 |
| nova            | 10.64.20.13:47270    |        1 |
| nova            | 10.64.20.13:47272    |        1 |
| nova            | 10.64.20.13:47274    |        1 |
| nova            | 10.64.20.13:47276    |        1 |
| nova            | 10.64.20.13:47278    |        1 |
| nova            | 10.64.20.13:47280    |        1 |
| nova            | 10.64.20.13:47282    |        1 |
| nova            | 10.64.20.13:47284    |        1 |
| nova            | 10.64.20.13:47286    |        1 |
| nova            | 10.64.20.13:47288    |        1 |
| nova            | 10.64.20.13:47290    |        1 |
| nova            | 10.64.20.13:47292    |        1 |
| nova            | 10.64.20.13:47294    |        1 |
| nova            | 10.64.20.13:47298    |        1 |
| nova            | 10.64.20.13:47302    |        1 |
| nova            | 10.64.20.13:47304    |        1 |
| nova            | 10.64.20.13:47306    |        1 |
| nova            | 10.64.20.13:47308    |        1 |
| nova            | 10.64.20.13:47310    |        1 |
| nova            | 10.64.20.13:47312    |        1 |
| nova            | 10.64.20.13:47314    |        1 |
| nova            | 10.64.20.13:47316    |        1 |
| nova            | 10.64.20.13:47318    |        1 |
| nova            | 10.64.20.13:47320    |        1 |
| nova            | 10.64.20.13:47322    |        1 |
| nova            | 10.64.20.13:47324    |        1 |
| nova            | 10.64.20.13:47326    |        1 |
| nova            | 10.64.20.13:47328    |        1 |
| nova            | 10.64.20.13:47330    |        1 |
| nova            | 10.64.20.13:47332    |        1 |
| nova            | 10.64.20.13:47346    |        1 |
| nova            | 208.80.154.92:33273  |        1 |
| nova            | 208.80.154.92:33274  |        1 |
| nova            | 208.80.154.92:33275  |        1 |
| nova            | 208.80.154.92:33276  |        1 |
| nova            | 208.80.154.92:33277  |        1 |
| nova            | 208.80.154.92:33278  |        1 |
| nova            | 208.80.154.92:33279  |        1 |
| nova            | 208.80.154.92:33280  |        1 |
| nova            | 208.80.154.92:33281  |        1 |
| nova            | 208.80.154.92:33282  |        1 |
| nova            | 208.80.154.92:33283  |        1 |
| nova            | 208.80.154.92:33284  |        1 |
| nova            | 208.80.154.92:33285  |        1 |
| nova            | 208.80.154.92:33286  |        1 |
| nova            | 208.80.154.92:33287  |        1 |
| nova            | 208.80.154.92:33288  |        1 |
| nova            | 208.80.154.92:33289  |        1 |
| nova            | 208.80.154.92:33290  |        1 |
| nova            | 208.80.154.92:33291  |        1 |
| nova            | 208.80.154.92:33292  |        1 |
| nova            | 208.80.154.92:33293  |        1 |
| nova            | 208.80.154.92:33294  |        1 |
| nova            | 208.80.154.92:33295  |        1 |
| nova            | 208.80.154.92:33296  |        1 |
| nova            | 208.80.154.92:33297  |        1 |
| nova            | 208.80.154.92:33298  |        1 |
| nova            | 208.80.154.92:33299  |        1 |
| nova            | 208.80.154.92:33300  |        1 |
| nova            | 208.80.154.92:33301  |        1 |
| nova            | 208.80.154.92:33302  |        1 |
| nova            | 208.80.154.92:33303  |        1 |
| nova            | 208.80.154.92:33304  |        1 |
| nova            | 208.80.154.92:33305  |        1 |
| nova            | 208.80.154.92:33306  |        1 |
| nova            | 208.80.154.92:33307  |        1 |
| nova            | 208.80.154.92:33308  |        1 |
| nova            | 208.80.154.92:33309  |        1 |
| nova            | 208.80.154.92:33310  |        1 |
| nova            | 208.80.154.92:33311  |        1 |
| nova            | 208.80.154.92:33312  |        1 |
| nova            | 208.80.154.92:33313  |        1 |
| nova            | 208.80.154.92:33314  |        1 |
| nova            | 208.80.154.92:33315  |        1 |
| nova            | 208.80.154.92:33316  |        1 |
| nova            | 208.80.154.92:33317  |        1 |
| nova            | 208.80.154.92:33318  |        1 |
| nova            | 208.80.154.92:33319  |        1 |
| nova            | 208.80.154.92:33320  |        1 |
| nova            | 208.80.154.92:33321  |        1 |
| nova            | 208.80.154.92:33322  |        1 |
| nova            | 208.80.154.92:33323  |        1 |
| nova            | 208.80.154.92:33324  |        1 |
| nova            | 208.80.154.92:33325  |        1 |
| nova            | 208.80.154.92:33326  |        1 |
| nova            | 208.80.154.92:33327  |        1 |
| nova            | 208.80.154.92:33328  |        1 |
| nova            | 208.80.154.92:33329  |        1 |
| nova            | 208.80.154.92:33330  |        1 |
| nova            | 208.80.154.92:33331  |        1 |
| nova            | 208.80.154.92:33332  |        1 |
| nova            | 208.80.154.92:33333  |        1 |
| nova            | 208.80.154.92:33334  |        1 |
| nova            | 208.80.154.92:33335  |        1 |
| nova            | 208.80.154.92:33336  |        1 |
| nova            | 208.80.154.92:33345  |        1 |
| repl            | 10.192.32.8:50918    |        1 |
| repl            | 10.64.0.15:38196     |        1 |
| root            | localhost            |        2 |
| testreduce      | 10.64.16.151:48118   |        1 |
| testreduce      | 10.64.16.151:48120   |        1 |
| watchdog        | 10.64.0.122:36246    |        1 |
| watchdog        | 10.64.0.122:36256    |        1 |
| watchdog        | 10.64.0.122:36268    |        1 |
| watchdog        | 10.64.0.122:36282    |        1 |
| watchdog        | 10.64.0.122:36288    |        1 |
| watchdog        | 10.64.0.122:37054    |        1 |
| watchdog        | 10.64.0.122:40374    |        1 |
| watchdog        | 10.64.0.122:43122    |        1 |
| watchdog        | 10.64.0.122:43164    |        1 |
| watchdog        | 10.64.0.122:43192    |        1 |
| watchdog        | 10.64.0.122:43216    |        1 |
+-----------------+----------------------+----------+
393 rows in set (0.00 sec)

Causing ongoing issues on nova and wikitech (labswiki).

Marostegui raised the priority of this task from Normal to High.Aug 20 2018, 12:47 PM
jcrespo renamed this task from db1009 overloaded by idle connections to the nova database to m5-master overloaded by idle connections to the nova database.Aug 20 2018, 12:48 PM

Mentioned in SAL (#wikimedia-operations) [2018-08-20T13:00:32Z] <marostegui> Increase max_connections from 500 to 800 on db1073 to triage issues - T188589

I asked the DBA team to raise limits for now to avoid contention. We should work on a long term solution to avoid saturating the DBs.

For now I have done:

root@db1073.eqiad.wmnet[(none)]>  show global variables like 'max_connections';
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| max_connections | 800   |
+-----------------+-------+
1 row in set (0.00 sec)

This is just a temporary solution but we'd need to revert this once the application issue has been fixed

Change 454020 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] keystone: limit the size of the connection pool so it reuses connections

https://gerrit.wikimedia.org/r/454020

Change 454020 merged by Bstorm:
[operations/puppet@production] keystone: limit the size of the connection pool so it reuses connections

https://gerrit.wikimedia.org/r/454020

Change 454042 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] keystone: Limiting worker process numbers

https://gerrit.wikimedia.org/r/454042

Change 454042 merged by Bstorm:
[operations/puppet@production] keystone: Limiting worker process numbers

https://gerrit.wikimedia.org/r/454042

Change 454055 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] nova: Limit number of worker processes

https://gerrit.wikimedia.org/r/454055

Change 454055 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] nova: Limit number of worker processes

https://gerrit.wikimedia.org/r/454055

Change 454075 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] nova: limit metadata workers

https://gerrit.wikimedia.org/r/454075

Change 454075 merged by Bstorm:
[operations/puppet@production] nova: limit metadata workers

https://gerrit.wikimedia.org/r/454075

Bstorm added a subscriber: Bstorm.Aug 20 2018, 6:39 PM

I've more than halved the number of nova workers. I didn't see a big drop in the usage on grafana this time. One thing I haven't done is limit the db connection pool for these workers. That might be a good idea.

The biggest issue overall with the current level was that cloudcontrol1003 has so many cpu cores and worker values in openstack mitaka start up with the same number. Since this thing has 24 cpu cores and hyperthreading, we were spinning up scads of worker processes. Each one has its own DB connection pool (which is unlimited in size by openstack by default--but limited to 5 by default in newton). That's why all my patches above don't do anything until we were reducing the number of workers, just to explain what I was doing.

Bstorm added a comment.EditedAug 20 2018, 7:01 PM

The fact that the idle timeout for api database connections is set at an hour by default might be why it didn't drop right away...

These settings are already changed for the main db service, but not for the api-db set of settings.

Actually, nova-api db connections are down to 11 :) Looks like the only remaining problem is nova db itself (nova-conductor). That's already limited to 8 workers on the new server, but it is set to 10 max pool size. So overall, there's 80 possible connections from the main without using the overflow. However, there are also 8 conductor workers on the old server. It can eat 160 connections and not even be considered in overflow. Additionally, while we configured the pool_timeout at 60 seconds (which is during pool connection inits), the connections are still not reaped until an hour goes by. I can reduce that by half so it will leave idle stuff out there less. I'll leave any other tweaks out until morning when more people are around to stop me. I think this should help things.

Change 454082 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] nova: shorten idle timeout for sqlalchemy to reap db connections

https://gerrit.wikimedia.org/r/454082

Change 454082 merged by Bstorm:
[operations/puppet@production] nova: shorten idle timeout for sqlalchemy to reap db connections

https://gerrit.wikimedia.org/r/454082

Thanks a lot Brooke for getting this fixed.
I will go back to 500 as max_connections tomorrow morning as it looks fine now.

Mentioned in SAL (#wikimedia-operations) [2018-08-21T04:54:40Z] <marostegui> Set max_connections back from 800 to 500 on db1073 - T188589

nova is now using 107 connections.
nova_api is using 5 connections.
The general health of the connection pool is a lot better with your changes though: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=37&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1073&var-port=9104&from=1534740780129&to=1534827180129

I have switched back max_connections from 800 to 500

@Bstorm are you planning to apply the final tweaks to nova as mentioned at T188589#4516087 to reduce nova's amount of connections? It has currently 139.
Thanks!

I've tried some already! I think there's somewhere else I might need to look.

Ah right! Thanks for the heads up, I wasn't aware :-)

Change 454830 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] nova: restrict worker numbers further

https://gerrit.wikimedia.org/r/454830

Change 454830 abandoned by Bstorm:
nova: restrict worker numbers further

Reason:
Found something else

https://gerrit.wikimedia.org/r/454830

Looking closer at the nova connections right now, they are all for two older servers. That said, we have 8 workers and one master for the conductor on labcontrol1001 (pool size of 10 + possible overflow of 25 connections), so they could easily keep 90 connections open if they use them. They are keeping 57 open. That suggests we are not at all constraining them at this time. With a pool size of 5, we'd have over 50 connections available without having to touch the overflow (or ever actually open a connection to serve a request). With a pool size of 4, we might actually see overflow used and find out how many connections, we actually use.

Does this sound sane, @chasemp @Andrew and @aborrero ?

Change 454843 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] nova: reduce the pool size for database connections a lot

https://gerrit.wikimedia.org/r/454843

Change 454843 merged by Bstorm:
[operations/puppet@production] nova: reduce the pool size for database connections a lot

https://gerrit.wikimedia.org/r/454843

Looking closer at the nova connections right now, they are all for two older servers. That said, we have 8 workers and one master for the conductor on labcontrol1001 (pool size of 10 + possible overflow of 25 connections), so they could easily keep 90 connections open if they use them. They are keeping 57 open. That suggests we are not at all constraining them at this time. With a pool size of 5, we'd have over 50 connections available without having to touch the overflow (or ever actually open a connection to serve a request). With a pool size of 4, we might actually see overflow used and find out how many connections, we actually use.

Does this sound sane, @chasemp @Andrew and @aborrero ?

+1

Mentioned in SAL (#wikimedia-cloud) [2018-08-23T16:17:42Z] <arturo> T188589 bstorm_ merged patch to reduce nova DB connection usage

Apart of our efforts to reduce DB usage by openstack, we should also consider increasing capacity in the DB side in case our Cloud VPS usage requires reverting those patches.

I already suggested having our own database:

  • our openstack deployments and usage will just continue growing
  • we don't want to affect other services (like wikitech the other day) because we are sharing the DB

What do you think?

aborrero closed this task as Resolved.Aug 24 2018, 7:49 AM

Closing task now since we all are happy with the number of connections Openstack is currently using in m5-master.

Change 463789 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Nova: reduce number of worker nodes

https://gerrit.wikimedia.org/r/463789

Change 463789 merged by Andrew Bogott:
[operations/puppet@production] Nova: reduce number of worker nodes

https://gerrit.wikimedia.org/r/463789