Page MenuHomePhabricator

nova-api service flaps
Closed, ResolvedPublic

Description

The haproxy logs show a lot of things like

Server nova_api_backend/cloudcontrol1006.wikimedia.org is DOWN, reason: Layer7 timeout, check duration: 3001ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

It seems to happen off and on on all backends -- sometimes all three are down and then it's visible to users.

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

So far I can't tell cause from effect from red herring:

[None req-2dde1ba0-35b2-4fe5-b32e-e02a7e7acba3 novaadmin admin - - default default] Exception during reset or similar: pymysql.err.OperationalError: (2006, "MySQL server has gone away (TimeoutError(110, 'Connection timed out'))")#0122023-01-30 03:03:24.060 3815588 ERROR sqlalchemy.pool.impl.QueuePool Traceback (most recent call last):#0122023-01-30 03:03:24.060 3815588 ERROR sqlalchemy.pool.impl.QueuePool   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 756, in _write_bytes#0122023-01-30 03:03:24.060 3815588 ERROR sqlalchemy.pool.impl.QueuePool     self._sock.sendall(data)#0122023-01-30 03:03:24.060 3815588 ERROR sqlalchemy.pool.impl.QueuePool TimeoutError: [Errno 110] Connection timed out#0122023-01-30 03:03:24.060 3815588 ERROR sqlalchemy.pool.impl.QueuePool #0122023-01-30 03:03:24.060 3815588 ERROR sqlalchemy.pool.impl.QueuePool During handling of the above exception, another exception occurred:#0122023-01-30 03:03:24.060 3815588 ERROR sqlalchemy.pool.impl.QueuePool #0122023-01-30 03:03:24.060 3815588 ERROR sqlalchemy.pool.impl.QueuePool Traceback (most recent call last):#0122023-01-30 03:03:24.060 3815588 ERROR sqlalchemy.pool.impl.QueuePool   File "/usr/lib/python3/dist-packages/sqlalchemy/pool/base.py", line 739, in _finalize_fairy#0122023-01-30 03:03:24.060 3815588 ERROR sqlalchemy.pool.impl.QueuePool     fairy._reset(pool)#0122023-01-30 03:03:24.060 3815588 ERROR sqlalchemy.pool.impl.QueuePool   File "/usr/lib/python3/dist-packages/sqlalchemy/pool/base.py", line 988, in _reset#0122023-01-30 03:03:24.060 3815588 ERROR sqlalchemy.pool.impl.QueuePool     pool._dialect.do_rollback(self)#0122023-01-30 03:03:24.060 3815588 ERROR sqlalchemy.pool.impl.QueuePool   File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 682, in do_rollback#0122023-01-30 03:03:24.060 3815588 ERROR sqlalchemy.pool.impl.QueuePool     dbapi_connection.rollback()#0122023-01-30 03:03:24.060 3815588 ERROR sqlalchemy.pool.impl.QueuePool   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 479, in rollback#0122023-01-30 03:03:24.060 3815588 ERROR sqlalchemy.pool.impl.QueuePool     self._execute_command(COMMAND.COM_QUERY, "ROLLBACK")#0122023-01-30 03:03:24.060 3815588 ERROR sqlalchemy.pool.impl.QueuePool   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 814, in _execute_command#0122023-01-30 03:03:24.060 3815588 ERROR sqlalchemy.pool.impl.QueuePool     self._write_bytes(packet)#0122023-01-30 03:03:24.060 3815588 ERROR sqlalchemy.pool.impl.QueuePool   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 759, in _write_bytes#0122023-01-30 03:03:24.060 3815588 ERROR sqlalchemy.pool.impl.QueuePool     raise err.OperationalError(#0122023-01-30 03:03:24.060 3815588 ERROR sqlalchemy.pool.impl.QueuePool pymysql.err.OperationalError: (2006, "MySQL server has gone away (TimeoutError(110, 'Connection timed out'))")#0122023-01-30 03:03:24.060 3815588 ERROR sqlalchemy.pool.impl.QueuePool

a couple of minutes later...

-] [99ca1723-354c-4c2a-86ab-c9f28e7be619] AMQP server on rabbitmq01.eqiad1.wikimediacloud.org:5671 is unreachable: [Errno 104] Connection reset by peer. Trying again in 0 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer

And here's the rabbit connection going down and coming back:

2023-01-30 02:44:40.362 [info] <0.30266.6672> Connection <0.30266.6672> (208.80.154.85:48636 -> 208.80.154.147:5671) has a client-provided name: uwsgi:3973491:734b14fd-7f16-4ea4-82e6-b48e292dd125
2023-01-30 02:44:40.364 [info] <0.30266.6672> connection <0.30266.6672> (208.80.154.85:48636 -> 208.80.154.147:5671 - uwsgi:3973491:734b14fd-7f16-4ea4-82e6-b48e292dd125): user 'nova' authenticated and granted access to vhost '/'
2023-01-30 03:08:12.062 [error] <0.22216.6372> closing AMQP connection <0.22216.6372> (208.80.155.104:49024 -> 208.80.154.147:5671 - uwsgi:3866805:99ca1723-354c-4c2a-86ab-c9f28e7be619):
missed heartbeats from client, timeout: 60s
2023-01-30 03:08:28.723 [info] <0.18072.6693> accepting AMQP connection <0.18072.6693> (208.80.155.104:57414 -> 208.80.154.147:5671)

Change 884922 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Rabbitmq: use OpenStack bpo packages for rabbit

https://gerrit.wikimedia.org/r/884922

Change 884922 merged by Andrew Bogott:

[operations/puppet@production] Rabbitmq: use OpenStack bpo packages for rabbit

https://gerrit.wikimedia.org/r/884922

The rabbit update doesn't seem to have resolved things.

I also see these keystone logs which look ominous. This happens around 4 times/hour:

[None req-5f826db9-873e-4d1f-b202-b8136507a41b novaadmin admin - - default default] Exception during reset or similar: pymysql.err.OperationalError: (2006, "MySQL server has gone away (TimeoutError(110, 'Connection timed out'))")#0122023-01-31 22:51:43.387 2975644 ERROR sqlalchemy.pool.impl.QueuePool Traceback (most recent call last):#0122023-01-31 22:51:43.387 2975644 ERROR sqlalchemy.pool.impl.QueuePool   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 756, in _write_bytes#0122023-01-31 22:51:43.387 2975644 ERROR sqlalchemy.pool.impl.QueuePool     self._sock.sendall(data)#0122023-01-31 22:51:43.387 2975644 ERROR sqlalchemy.pool.impl.QueuePool TimeoutError: [Errno 110] Connection timed out#0122023-01-31 22:51:43.387 2975644 ERROR sqlalchemy.pool.impl.QueuePool #0122023-01-31 22:51:43.387 2975644 ERROR sqlalchemy.pool.impl.QueuePool During handling of the above exception, another exception occurred:#0122023-01-31 22:51:43.387 2975644 ERROR sqlalchemy.pool.impl.QueuePool #0122023-01-31 22:51:43.387 2975644 ERROR sqlalchemy.pool.impl.QueuePool Traceback (most recent call last):#0122023-01-31 22:51:43.387 2975644 ERROR sqlalchemy.pool.impl.QueuePool   File "/usr/lib/python3/dist-packages/sqlalchemy/pool/base.py", line 739, in _finalize_fairy#0122023-01-31 22:51:43.387 2975644 ERROR sqlalchemy.pool.impl.QueuePool     fairy._reset(pool)#0122023-01-31 22:51:43.387 2975644 ERROR sqlalchemy.pool.impl.QueuePool   File "/usr/lib/python3/dist-packages/sqlalchemy/pool/base.py", line 988, in _reset#0122023-01-31 22:51:43.387 2975644 ERROR sqlalchemy.pool.impl.QueuePool     pool._dialect.do_rollback(self)#0122023-01-31 22:51:43.387 2975644 ERROR sqlalchemy.pool.impl.QueuePool   File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 682, in do_rollback#0122023-01-31 22:51:43.387 2975644 ERROR sqlalchemy.pool.impl.QueuePool     dbapi_connection.rollback()#0122023-01-31 22:51:43.387 2975644 ERROR sqlalchemy.pool.impl.QueuePool   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 479, in rollback#0122023-01-31 22:51:43.387 2975644 ERROR sqlalchemy.pool.impl.QueuePool     self._execute_command(COMMAND.COM_QUERY, "ROLLBACK")#0122023-01-31 22:51:43.387 2975644 ERROR sqlalchemy.pool.impl.QueuePool   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 814, in _execute_command#0122023-01-31 22:51:43.387 2975644 ERROR sqlalchemy.pool.impl.QueuePool     self._write_bytes(packet)#0122023-01-31 22:51:43.387 2975644 ERROR sqlalchemy.pool.impl.QueuePool   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 759, in _write_bytes#0122023-01-31 22:51:43.387 2975644 ERROR sqlalchemy.pool.impl.QueuePool     raise err.OperationalError(#0122023-01-31 22:51:43.387 2975644 ERROR sqlalchemy.pool.impl.QueuePool pymysql.err.OperationalError: (2006, "MySQL server has gone away (TimeoutError(110, 'Connection timed out'))")#0122023-01-31 22:51:43.387 2975644 ERROR sqlalchemy.pool.impl.QueuePool

Mysql says wait_timeout is 28800

haproxy says its client and server timeouts for mysql is 90m (== 5400)

keystone says connection_recycle_time is 3600.

So how is keystone ever getting a timeout?

haproxy doesn't show any mysql activity, so it isn't breaking connections.

I merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/888107 in an attempt to standardize db config. I don't really expect it to fix nova but I'm hopeful that it will get the 'has gone away' warnings to stop, or at least provide us with a single lever to debug with.

Change 888220 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Openstack database config: lower timeouts quite a bit

https://gerrit.wikimedia.org/r/888220

Change 888220 merged by Andrew Bogott:

[operations/puppet@production] Openstack database config: lower timeouts quite a bit

https://gerrit.wikimedia.org/r/888220

Change 888242 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Openstack database config: enable 'use_db_reconnect'

https://gerrit.wikimedia.org/r/888242

Change 888242 merged by Andrew Bogott:

[operations/puppet@production] Openstack database config: enable 'use_db_reconnect'

https://gerrit.wikimedia.org/r/888242

Change 888703 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps/galera: standardize timeouts across haprox/mysql

https://gerrit.wikimedia.org/r/888703

Change 888703 merged by Andrew Bogott:

[operations/puppet@production] cloud-vps/galera: standardize timeouts across haprox/mysql

https://gerrit.wikimedia.org/r/888703

Change 888703 merged by Andrew Bogott:

[operations/puppet@production] cloud-vps/galera: standardize timeouts across haprox/mysql

https://gerrit.wikimedia.org/r/888703

I reverted that one because it made things worse.

Change 889235 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] OpenStack nova: increase nova-api workers per node from 2 to 6

https://gerrit.wikimedia.org/r/889235

Change 889235 merged by Andrew Bogott:

[operations/puppet@production] OpenStack nova: increase nova-api workers per node from 2 to 8

https://gerrit.wikimedia.org/r/889235

I think this is fixed with the increase in worker nodes.