Page MenuHomePhabricator

neutron agents losing RabbitMQ connectivity don't crash properly
Open, Needs TriagePublic

Description

If a Neutron agent gets disconnected from Rabbit, it logs errors like this:

Jun 22 13:56:21 cloudvirt1035 neutron-linuxbridge-agent[3716842]: 2022-06-22 13:56:21.600 3716842 ERROR oslo.messaging._drivers.impl_rabbit [-] [28ad61da-560c-4ddd-ba63-2f01d4b402a6] AMQP server on cloudcontrol1005.wikimedia.org:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 6 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSED
Jun 22 13:56:22 cloudvirt1035 neutron-linuxbridge-agent[3716842]: 2022-06-22 13:56:22.570 3716842 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
Jun 22 13:56:22 cloudvirt1035 neutron-linuxbridge-agent[3716842]: 2022-06-22 13:56:22.617 3716842 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
Jun 22 13:56:22 cloudvirt1035 neutron-linuxbridge-agent[3716842]: Traceback (most recent call last):
Jun 22 13:56:22 cloudvirt1035 neutron-linuxbridge-agent[3716842]:   File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers
Jun 22 13:56:22 cloudvirt1035 neutron-linuxbridge-agent[3716842]:     timer()
Jun 22 13:56:22 cloudvirt1035 neutron-linuxbridge-agent[3716842]:   File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
Jun 22 13:56:22 cloudvirt1035 neutron-linuxbridge-agent[3716842]:     cb(*args, **kw)
Jun 22 13:56:22 cloudvirt1035 neutron-linuxbridge-agent[3716842]:   File "/usr/lib/python3/dist-packages/eventlet/semaphore.py", line 152, in _do_acquire
Jun 22 13:56:22 cloudvirt1035 neutron-linuxbridge-agent[3716842]:     waiter.switch()
Jun 22 13:56:22 cloudvirt1035 neutron-linuxbridge-agent[3716842]: greenlet.error: cannot switch to a different thread

But it hangs instead of crashing, which means it needs a manual restart. We have alerting to let us know when to do that, but it still should happen automatically.