Page MenuHomePhabricator

cloudvps: neutron: agents failed to communicate with server
Closed, ResolvedPublic

Description

I've detected that, from time to time, neutron agents failed to communicate with the neutron server.

This can be seen in the neutron server logs as follows:

2018-09-26 09:47:15.810 36689 WARNING neutron.db.agents_db [req-cd20f8f7-c278-4a2b-b9a8-f30dc77c275e - - - - -] Agent healthcheck: found 9 dead agents out of 12:
                Type       Last heartbeat host
          DHCP agent  2018-09-26 09:35:22 cloudnet1004
            L3 agent  2018-09-26 09:41:44 cloudnet1003
            L3 agent  2018-09-26 09:40:27 cloudnet1004
  Linux bridge agent  2018-09-26 09:37:43 cloudnet1003
  Linux bridge agent  2018-09-26 09:37:20 cloudvirt1020
  Linux bridge agent  2018-09-26 09:19:10 cloudvirt1021
  Linux bridge agent  2018-09-26 09:39:14 cloudvirt1022
  Linux bridge agent  2018-09-26 09:41:28 cloudvirt1019
          DHCP agent  2018-09-26 09:29:29 cloudnet1003

Also:

root@cloudcontrol1003:~# neutron agent-list
+--------------------------------------+--------------------+---------------+-------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host          | availability_zone | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+---------------+-------------------+-------+----------------+---------------------------+
| 468aef2a-8eb6-4382-abba-bc284efd9fa5 | DHCP agent         | cloudnet1004  | nova              | xxx   | True           | neutron-dhcp-agent        |
| 601bef99-b53c-4e6a-b384-65d1feebedff | Metadata agent     | cloudnet1003  |                   | :-)   | True           | neutron-metadata-agent    |
| 8af5d8a1-2e29-40e6-baf0-3cd79a7ac77b | L3 agent           | cloudnet1003  | nova              | :-)   | True           | neutron-l3-agent          |
| 970df1d1-505d-47a4-8d35-1b13c0dfe098 | L3 agent           | cloudnet1004  | nova              | xxx   | True           | neutron-l3-agent          |
| 9f8833de-11a4-4395-8da5-f57fe8326659 | Linux bridge agent | cloudnet1003  |                   | xxx   | True           | neutron-linuxbridge-agent |
| ad3461d7-b79e-4279-921d-5a476e296767 | Linux bridge agent | cloudnet1004  |                   | xxx   | True           | neutron-linuxbridge-agent |
| b0f1cdf2-8d03-4f7b-978c-201ecea69b84 | Linux bridge agent | cloudvirt1020 |                   | xxx   | True           | neutron-linuxbridge-agent |
| b2f9da63-2f16-4aa5-9400-ae708a733f91 | Linux bridge agent | cloudvirt1021 |                   | xxx   | True           | neutron-linuxbridge-agent |
| d475e07d-52b3-476e-9a4f-e63b21e1075e | Metadata agent     | cloudnet1004  |                   | :-)   | True           | neutron-metadata-agent    |
| e382a233-e6a0-422e-9d2e-5651082783fc | Linux bridge agent | cloudvirt1022 |                   | xxx   | True           | neutron-linuxbridge-agent |
| fc45a34d-d8a4-45fe-982d-5b4b7a8fcde1 | Linux bridge agent | cloudvirt1019 |                   | xxx   | True           | neutron-linuxbridge-agent |
| ff2a8228-3748-4588-927b-4b6563da9ca0 | DHCP agent         | cloudnet1003  | nova              | xxx   | True           | neutron-dhcp-agent        |
+--------------------------------------+--------------------+---------------+-------------------+-------+----------------+---------------------------+

An example of logs in a failed agent:

2018-09-26 09:29:03.512 1355 INFO neutron.agent.dhcp.agent [-] Synchronizing state
2018-09-26 09:29:03.779 1355 INFO neutron.agent.dhcp.agent [-] All active networks have been fetched through RPC.
2018-09-26 09:29:03.780 1355 INFO neutron.agent.dhcp.agent [-] Starting network 7425e328-560c-4f00-8e99-706f3fb90bb4 dhcp configuration
2018-09-26 09:29:04.840 1355 INFO neutron.agent.dhcp.agent [-] Finished network 7425e328-560c-4f00-8e99-706f3fb90bb4 dhcp configuration
2018-09-26 09:29:04.842 1355 INFO neutron.agent.dhcp.agent [-] Synchronizing state complete
2018-09-26 09:39:59.749 1355 ERROR neutron.common.rpc [req-066563a7-fea1-47aa-ba5e-2a076743090d - - - - -] Timeout in RPC method report_state. Waiting for 5 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent [req-066563a7-fea1-47aa-ba5e-2a076743090d - - - - -] Failed reporting state!
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/dist-packages/neutron/agent/dhcp/agent.py", line 583, in _report_state
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent     ctx, self.agent_state, True)
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/dist-packages/neutron/agent/rpc.py", line 86, in report_state
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent     return method(context, 'report_state', **kwargs)
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 155, in call
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent     time.sleep(wait)
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent     self.force_reraise()
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent     six.reraise(self.type_, self.value, self.tb)
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 136, in call
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent     return self._original_context.call(ctxt, method, **kwargs)
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 158, in call
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent     retry=self.retry)
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 90, in _send
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent     timeout=timeout, retry=retry)
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 470, in send
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent     retry=retry)
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in _send
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent     result = self._waiter.wait(msg_id, timeout)
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 342, in wait
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent     message = self.waiters.get(msg_id, timeout=timeout)
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 244, in get
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent     'to message ID %s' % msg_id)
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent MessagingTimeout: Timed out waiting for a reply to message ID 214a6541df574198acecd0d25f7ac20f
2018-09-26 09:40:05.130 1355 ERROR neutron.agent.dhcp.agent 
2018-09-26 09:40:05.132 1355 WARNING oslo.service.loopingcall [req-066563a7-fea1-47aa-ba5e-2a076743090d - - - - -] Function 'neutron.agent.dhcp.agent.DhcpAgentWithStateReport._report_state' run outlasted interval by 575.39 sec

Which may indicate issues with the queuing system.

Details

Related Gerrit Patches:
operations/dns : masterAdd IPv6 for cloudvirt102[0-2].eqiad.wmnet
operations/puppet : productionrabbitmq: fix ferm list syntax
operations/puppet : productionRabbitmq: allow labvirt/cloudvirt access via ipv6
operations/dns : masterAdded ipv6 for cloudvirt1019.eqiad.wmnet
operations/puppet : productioncloudvirt1019: add ipv6 address
operations/puppet : productioncloudvps: eqiad1: use lower timeout values

Event Timeline

aborrero triaged this task as Medium priority.Sep 26 2018, 9:50 AM
aborrero created this task.
Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptSep 26 2018, 9:50 AM

Mentioned in SAL (#wikimedia-cloud) [2018-09-26T10:10:38Z] <arturo> restart neutron-server in cloudcontrol1003, investigating T205524

Mentioned in SAL (#wikimedia-cloud) [2018-09-26T10:13:26Z] <arturo> T205524 restart all agents in cloudnet1004.eqiad.wmnet

Mentioned in SAL (#wikimedia-cloud) [2018-09-26T10:20:15Z] <arturo> T205524 stop/start all neutron agents in cloudnet1003.eqiad.wmnet

All the restarts do nothing.

I'm convinced there is some issue with the message queuing system.

Running neutron-l3-agent in debug mode in cloudnet1003.eqiad.wmnet:

2018-09-26 10:33:10.819 46786 DEBUG oslo_messaging._drivers.amqpdriver [-] CALL msg_id: d9e25bf8d1cf44769c821f90e2cbf7ac exchange 'neutron' topic 'q-l3-plugin' _send /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:454
2018-09-26 10:33:10.872 46786 DEBUG oslo_messaging._drivers.amqpdriver [-] received reply msg_id: d9e25bf8d1cf44769c821f90e2cbf7ac __call__ /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:302
2018-09-26 10:33:18.944 46786 DEBUG oslo_messaging._drivers.amqpdriver [-] CALL msg_id: a4d05d345a8b4bf39a88b58e9db3b2da exchange 'neutron' topic 'q-reports-plugin' _send /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:454
2018-09-26 10:33:18.964 46786 DEBUG oslo_messaging._drivers.amqpdriver [-] received reply msg_id: a4d05d345a8b4bf39a88b58e9db3b2da __call__ /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:302
[... looks good, but suddenly .-.]
2018-09-26 10:33:48.947 46786 DEBUG oslo_messaging._drivers.amqpdriver [-] CALL msg_id: 420ad9b0f8b7456ea0165cb4cb803bae exchange 'neutron' topic 'q-reports-plugin' _send /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:454
2018-09-26 10:34:12.983 46786 DEBUG oslo_service.periodic_task [req-f97d128d-3b92-4654-a94e-c27bc8939746 - - - - -] Running periodic task L3NATAgentWithStateReport.periodic_sync_routers_task run_periodic_tasks /usr/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215
[... ^^^ no instant reply as before? ^^^ ...]
2018-09-26 10:34:48.735 46786 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" acquired by "neutron.agent.linux.external_process._check_child_processes" :: waited 0.000s inner /usr/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:273
2018-09-26 10:34:48.736 46786 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" released by "neutron.agent.linux.external_process._check_child_processes" :: held 0.001s inner /usr/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:285
2018-09-26 10:34:48.886 46786 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" acquired by "neutron.agent.linux.external_process._check_child_processes" :: waited 0.000s inner /usr/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:273
2018-09-26 10:34:48.887 46786 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" released by "neutron.agent.linux.external_process._check_child_processes" :: held 0.001s inner /usr/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:285
2018-09-26 10:34:48.953 46786 ERROR neutron.common.rpc [-] Timeout in RPC method report_state. Waiting for 36 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.
2018-09-26 10:34:48.953 46786 WARNING neutron.common.rpc [-] Increasing timeout for report_state calls to 120 seconds. Restart the agent to restore it to the default value.
2018-09-26 10:34:52.984 46786 DEBUG oslo_service.periodic_task [req-f97d128d-3b92-4654-a94e-c27bc8939746 - - - - -] Running periodic task L3NATAgentWithStateReport.periodic_sync_routers_task run_periodic_tasks /usr/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent [-] Failed reporting state!
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py", line 682, in _report_state
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent     True)
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/neutron/agent/rpc.py", line 86, in report_state
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent     return method(context, 'report_state', **kwargs)
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 155, in call
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent     time.sleep(wait)
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent     self.force_reraise()
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent     six.reraise(self.type_, self.value, self.tb)
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 136, in call
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent     return self._original_context.call(ctxt, method, **kwargs)
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 158, in call
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent     retry=self.retry)
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 90, in _send
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent     timeout=timeout, retry=retry)
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 470, in send
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent     retry=retry)
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in _send
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent     result = self._waiter.wait(msg_id, timeout)
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 342, in wait
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent     message = self.waiters.get(msg_id, timeout=timeout)
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 244, in get
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent     'to message ID %s' % msg_id)
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent MessagingTimeout: Timed out waiting for a reply to message ID 420ad9b0f8b7456ea0165cb4cb803bae
2018-09-26 10:35:24.844 46786 ERROR neutron.agent.l3.agent 
2018-09-26 10:35:24.847 46786 WARNING oslo.service.loopingcall [-] Function 'neutron.agent.l3.agent.L3NATAgentWithStateReport._report_state' run outlasted interval by 65.90 sec
[...]

Mentioned in SAL (#wikimedia-cloud) [2018-09-26T10:40:32Z] <arturo> T205524 all sorts of restarts in all neutron daemons

For a working agent<->server communication I can see messages flowing in both directions:

l3-agent
2018-09-26 10:44:14.088 331 DEBUG oslo_messaging._drivers.amqpdriver [-] received reply msg_id: 3c7868c2f62c4d43b6fb7b69c47940bb __call__ /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:302
2018-09-26 10:44:14.089 331 DEBUG oslo_messaging._drivers.amqpdriver [req-e476f756-086c-4eea-afb7-781182ff78f1 - - - - -] CALL msg_id: 894df06cd44641c29e1cd6810a962aa3 exchange 'neutron' topic 'q-l3-plugin' _send /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:454
server
root@cloudcontrol1003:~# grep 3c7868c2f62c4d43b6fb7b69c47940bb /var/log/neutron/neutron-server.log
2018-09-26 10:44:14.015 34188 DEBUG oslo_messaging._drivers.amqpdriver [-] received message msg_id: 3c7868c2f62c4d43b6fb7b69c47940bb reply to reply_aa0977683f144547b55808cc25debb65 __call__ /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:201
2018-09-26 10:44:14.085 34188 DEBUG oslo_messaging._drivers.amqpdriver [req-c3c5c54a-1478-41be-884c-44605521aa97 - - - - -] sending reply msg_id: 3c7868c2f62c4d43b6fb7b69c47940bb reply queue: reply_aa0977683f144547b55808cc25debb65 time elapsed: 0.0685047409497s _send_reply /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:74

When there are missed messages, I just can't find the messages in neutron server logs, as if the message never existed for neutron-server (however, the l3-agent sent it):

l3-agent
2018-09-26 10:44:06.281 331 ERROR neutron.agent.l3.agent MessagingTimeout: Timed out waiting for a reply to message ID 7d88b9deb5c248168b3715fa929ff533
2018-09-26 10:44:06.281 331 ERROR neutron.agent.l3.agent 
2018-09-26 10:44:06.285 331 WARNING oslo.service.loopingcall [-] Function 'neutron.agent.l3.agent.L3NATAgentWithStateReport._report_state' run outlasted interval by 45.32 sec
server
root@cloudcontrol1003:~# grep 7d88b9deb5c248168b3715fa929ff533 /var/log/neutron/neutron-server.log
root@cloudcontrol1003:~# 
[..nothing...]

So, it seems in this case the message with id 7d88b9deb5c248168b3715fa929ff533 got lost in the middle.

Change 463062 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvps: eqiad1: use lower timeout values

https://gerrit.wikimedia.org/r/463062

Change 463062 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudvps: eqiad1: use lower timeout values

https://gerrit.wikimedia.org/r/463062

15:17 <arturo> well, after the patch, even though we still have the problem of unreplied rabbitmq messages and agents go down arbitrarily, I see them coming alive again much faster
15:18 <arturo> which may be a temporal mitigation

Change 463079 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudvirt1019: add ipv6 address

https://gerrit.wikimedia.org/r/463079

Change 463079 merged by Andrew Bogott:
[operations/puppet@production] cloudvirt1019: add ipv6 address

https://gerrit.wikimedia.org/r/463079

Change 463083 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/dns@master] Added ipv6 for cloudvirt1019.eqiad.wmnet

https://gerrit.wikimedia.org/r/463083

Change 463083 merged by Andrew Bogott:
[operations/dns@master] Added ipv6 for cloudvirt1019.eqiad.wmnet

https://gerrit.wikimedia.org/r/463083

Change 463090 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Rabbitmq: allow labvirt/cloudvirt access via ipv6

https://gerrit.wikimedia.org/r/463090

Change 463090 merged by Andrew Bogott:
[operations/puppet@production] Rabbitmq: allow labvirt/cloudvirt access via ipv6

https://gerrit.wikimedia.org/r/463090

Change 463092 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] rabbitmq: fix ferm list syntax

https://gerrit.wikimedia.org/r/463092

Change 463092 merged by Andrew Bogott:
[operations/puppet@production] rabbitmq: fix ferm list syntax

https://gerrit.wikimedia.org/r/463092

Current state of this:

  • the agent status list seems to be more relaxed now, we don't see flapping states
  • we don't see anything weird in the neutron-server log
  • neutron doesn't seem to log all agents heartbeats ¿¿ (specially linuxbridge agents) but we assume they are working, even though they show the timeout tracebacks in their logs

I think we can wait a couple of days (or weeks) and see how things evolve.

Next steps, in case we want to follow up investigating this:

  • dive further into google looking for already opened bugs, reports or something like that
  • investigate how to further investigate what's going on inside rabbitmq with these unreplied messages
  • dive into the neutron source code :-/

Change 463138 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/dns@master] Add IPv6 for cloudvirt102[0-2].eqiad.wmnet

https://gerrit.wikimedia.org/r/463138

Change 463138 merged by GTirloni:
[operations/dns@master] Add IPv6 for cloudvirt102[0-2].eqiad.wmnet

https://gerrit.wikimedia.org/r/463138

I can't find the error messages in the logs anymore but it's very likely I'm not looking right. @aborrero does the situation look better now?

I can't find the error messages in the logs anymore but it's very likely I'm not looking right. @aborrero does the situation look better now?

I've been checking for a couple of days and I don't detect the issue now. But I don't fully trust neutron behaving correctly or our fix being definitive :-P So I will keep checking from time to time some more time.

You can also check and ping me if you see something weird.

I usually check the output of neutron agent-list and also /var/log/neutron/neutron-server.log (in the cloudnet nodes) and /var/log/neutron/neutron-*-agent.log (in agent nodes).

After T205986: cloudnet1004: spontaneous reboot today, I see issues with agents again all over the place, including python tracebacks in the logs :-(

It seems these issues continue to happen but sporadically, for up to a minute:

2018-11-28 09:32:57.849 24211 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 0 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.
2018-11-28 09:32:58.055 24211 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 3 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.
2018-11-30 01:06:41.969 24198 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 3 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.
...
2018-11-30 01:07:16.053 24202 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 2 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.
2018-12-01 03:01:31.737 38696 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 2 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.
...
2018-12-01 03:01:35.003 38674 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 3 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.

In recent logs, the timeouts are happening in get_ports (as opposed to the report_state in this task's description). This particular error seems to be described in https://access.redhat.com/solutions/3292151 (unfortunately my free developer subscription isn't allowed to access it).

It's interesting that all the message IDs in the errors seem to arrive a minute later but, by that time, the threads that made the calls don't exist anymore. Errors start at 03:01:31 and all messages arrive at once at 03:01:39:

2018-12-01 03:01:39.051 38696 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : dcb440f397f940a29f314518a109f99e
2018-12-01 03:01:39.102 38680 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 7f31e0103e324b49ba5898da86e3e633
2018-12-01 03:01:39.178 38696 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : ea6b219cae65439a90b2e2314bb09b29
2018-12-01 03:01:39.251 38683 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : b01084ad974c48229384d6b5f4e0938b
2018-12-01 03:01:39.303 38694 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 6d549ff673ac43b6a515945938a2e4ba
...

@arturo Maybe we can adjust the timeouts like you did before? It feels like a <10sec delay should trigger such extreme cleanup of threads. Maybe we can wait 30-45sec in such cases?

On the topic of monitoring RabbitMQ, we have the rabbitmq-exporter running but I couldn't find a dashboard for it. Do you know if we have something to visualize those metrics?

Related: T170492

aborrero lowered the priority of this task from Medium to Low.Dec 3 2018, 10:27 AM
aborrero moved this task from Inbox to Important on the cloud-services-team (Kanban) board.

I just created a grafana dashboard: https://grafana.wikimedia.org/dashboard/db/cloudvps-rabbitmq?orgId=1&from=1514764800000&to=1543832014906

Unfortunately it seems we stopped storing data a couple of weeks ago (probably worth creating a separate phab task).

I read T170492: figure out if nodepool is overwhelming rabbitmq and/or nova and checked this:

aborrero@cloudcontrol1003:~$ sudo zgrep neutron /var/log/rabbitmq/notifications_error.log.*.gz | less

But the most recent entry is "timestamp": "2018-09-26 09:20:48.237875". So we don't have any additional info on what's failing.

Our current values for timeouts/refresh/reports etc have been working quite stable for the last couple of months. Would you really change them? I'm not really worried about this issue as long as is not striking us.
Before any additional fine-tuning I would suggest we get metrics properly running.

I don't think we are being impacted anymore. If things are breaking, they are coming back to normal pretty fast.

I might make it more difficult to monitor our logs with all that noise but we can worry about that when we actually need to.

GTirloni removed a subscriber: GTirloni.Dec 20 2018, 6:49 PM
Dzahn added a subscriber: Dzahn.Jan 4 2019, 12:02 AM

Icinga says there is no kvm running on cloudvirt1019. Code changes including that name linked to this ticket

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cloudvirt1019&service=ensure+kvm+processes+are+running

Icinga says there is no kvm running on cloudvirt1019. Code changes including that name linked to this ticket
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cloudvirt1019&service=ensure+kvm+processes+are+running

cloudvirt1019 has HW issues. I don't think it's related to this particular phab task.

cc @JHedden this may be related to the rammitmq issues you detected.

Dzahn removed a subscriber: Dzahn.Jul 29 2019, 7:01 PM
JHedden closed this task as Resolved.Sep 11 2019, 3:26 PM
JHedden claimed this task.

T224981 seems to have resolved this.