I've detected that, from time to time, neutron agents failed to communicate with the neutron server.
This can be seen in the neutron server logs as follows:
2018-09-26 09:47:15.810 36689 WARNING neutron.db.agents_db [req-cd20f8f7-c278-4a2b-b9a8-f30dc77c275e - - - - -] Agent healthcheck: found 9 dead agents out of 12: Type Last heartbeat host DHCP agent 2018-09-26 09:35:22 cloudnet1004 L3 agent 2018-09-26 09:41:44 cloudnet1003 L3 agent 2018-09-26 09:40:27 cloudnet1004 Linux bridge agent 2018-09-26 09:37:43 cloudnet1003 Linux bridge agent 2018-09-26 09:37:20 cloudvirt1020 Linux bridge agent 2018-09-26 09:19:10 cloudvirt1021 Linux bridge agent 2018-09-26 09:39:14 cloudvirt1022 Linux bridge agent 2018-09-26 09:41:28 cloudvirt1019 DHCP agent 2018-09-26 09:29:29 cloudnet1003
Also:
root@cloudcontrol1003:~# neutron agent-list +--------------------------------------+--------------------+---------------+-------------------+-------+----------------+---------------------------+ | id | agent_type | host | availability_zone | alive | admin_state_up | binary | +--------------------------------------+--------------------+---------------+-------------------+-------+----------------+---------------------------+ | 468aef2a-8eb6-4382-abba-bc284efd9fa5 | DHCP agent | cloudnet1004 | nova | xxx | True | neutron-dhcp-agent | | 601bef99-b53c-4e6a-b384-65d1feebedff | Metadata agent | cloudnet1003 | | :-) | True | neutron-metadata-agent | | 8af5d8a1-2e29-40e6-baf0-3cd79a7ac77b | L3 agent | cloudnet1003 | nova | :-) | True | neutron-l3-agent | | 970df1d1-505d-47a4-8d35-1b13c0dfe098 | L3 agent | cloudnet1004 | nova | xxx | True | neutron-l3-agent | | 9f8833de-11a4-4395-8da5-f57fe8326659 | Linux bridge agent | cloudnet1003 | | xxx | True | neutron-linuxbridge-agent | | ad3461d7-b79e-4279-921d-5a476e296767 | Linux bridge agent | cloudnet1004 | | xxx | True | neutron-linuxbridge-agent | | b0f1cdf2-8d03-4f7b-978c-201ecea69b84 | Linux bridge agent | cloudvirt1020 | | xxx | True | neutron-linuxbridge-agent | | b2f9da63-2f16-4aa5-9400-ae708a733f91 | Linux bridge agent | cloudvirt1021 | | xxx | True | neutron-linuxbridge-agent | | d475e07d-52b3-476e-9a4f-e63b21e1075e | Metadata agent | cloudnet1004 | | :-) | True | neutron-metadata-agent | | e382a233-e6a0-422e-9d2e-5651082783fc | Linux bridge agent | cloudvirt1022 | | xxx | True | neutron-linuxbridge-agent | | fc45a34d-d8a4-45fe-982d-5b4b7a8fcde1 | Linux bridge agent | cloudvirt1019 | | xxx | True | neutron-linuxbridge-agent | | ff2a8228-3748-4588-927b-4b6563da9ca0 | DHCP agent | cloudnet1003 | nova | xxx | True | neutron-dhcp-agent | +--------------------------------------+--------------------+---------------+-------------------+-------+----------------+---------------------------+
An example of logs in a failed agent:
2018-09-26 09:29:03.512 1355 INFO neutron.agent.dhcp.agent [-] Synchronizing state 2018-09-26 09:29:03.779 1355 INFO neutron.agent.dhcp.agent [-] All active networks have been fetched through RPC. 2018-09-26 09:29:03.780 1355 INFO neutron.agent.dhcp.agent [-] Starting network 7425e328-560c-4f00-8e99-706f3fb90bb4 dhcp configuration 2018-09-26 09:29:04.840 1355 INFO neutron.agent.dhcp.agent [-] Finished network 7425e328-560c-4f00-8e99-706f3fb90bb4 dhcp configuration 2018-09-26 09:29:04.842 1355 INFO neutron.agent.dhcp.agent [-] Synchronizing state complete 2018-09-26 09:39:59.749 1355 ERROR neutron.common.rpc [req-066563a7-fea1-47aa-ba5e-2a076743090d - - - - -] Timeout in RPC method report_state. Waiting for 5 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough. 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent [req-066563a7-fea1-47aa-ba5e-2a076743090d - - - - -] Failed reporting state! 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent Traceback (most recent call last): 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/dhcp/agent.py", line 583, in _report_state 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent ctx, self.agent_state, True) 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/rpc.py", line 86, in report_state 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent return method(context, 'report_state', **kwargs) 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 155, in call 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent time.sleep(wait) 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__ 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent self.force_reraise() 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent six.reraise(self.type_, self.value, self.tb) 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 136, in call 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent return self._original_context.call(ctxt, method, **kwargs) 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 158, in call 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent retry=self.retry) 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 90, in _send 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent timeout=timeout, retry=retry) 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 470, in send 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent retry=retry) 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in _send 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent result = self._waiter.wait(msg_id, timeout) 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 342, in wait 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent message = self.waiters.get(msg_id, timeout=timeout) 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 244, in get 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent 'to message ID %s' % msg_id) 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent MessagingTimeout: Timed out waiting for a reply to message ID 214a6541df574198acecd0d25f7ac20f 2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent 2018-09-26 09:40:05.132 1355 WARNING oslo.service.loopingcall [req-066563a7-fea1-47aa-ba5e-2a076743090d - - - - -] Function 'neutron.agent.dhcp.agent.DhcpAgentWithStateReport._report_state' run outlasted interval by 575.39 sec
Which may indicate issues with the queuing system.