Page MenuHomePhabricator

[nova-api,cloudrabbit] Connectivity issues from all cloudcontrols to all cloudrabbit nodes
Open, HighPublic

Description

We are seeing connectivity issues from all of the cloudcontrols to all of the cloudrabbit nodes.

This seems to still work, but the services keep reconnecting.

Some logs:

root@cloudcontrol1006:~# journalctl -n 100 -u nova-api
...
Feb 05 10:00:02 cloudcontrol1006 nova-api-wsgi[902532]: 2024-02-05 10:00:02.420 902532 ERROR oslo.messaging._drivers.impl_rabbit [-] [01586216-7dd4-4b9a-b924-d706dcf3a8a5] AMQP server on rabbitmq03.eqiad1.wikimediacloud.org:5671 is unreachable: Server unexpectedly closed connection. Trying again in 0 seconds.: OSError: Server unexpectedly closed connection
Feb 05 10:00:02 cloudcontrol1006 nova-api-wsgi[902532]: 2024-02-05 10:00:02.555 902532 INFO oslo.messaging._drivers.impl_rabbit [-] [01586216-7dd4-4b9a-b924-d706dcf3a8a5] Reconnected to AMQP server on rabbitmq03.eqiad1.wikimediacloud.org:5671 via [amqp] client with port 34844.
Feb 05 10:00:27 cloudcontrol1006 nova-api-wsgi[985534]: 2024-02-05 10:00:27.710 985534 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: EOF occurred in violation of protocol (_ssl.c:2393)
Feb 05 10:00:27 cloudcontrol1006 nova-api-wsgi[985534]: 2024-02-05 10:00:27.713 985534 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: EOF occurred in violation of protocol (_ssl.c:2393)
Feb 05 10:01:15 cloudcontrol1006 nova-api-wsgi[985534]: 2024-02-05 10:01:15.894 985534 ERROR oslo.messaging._drivers.impl_rabbit [-] [8ea3e25f-35ca-42a1-83a3-31c95bf8e76d] AMQP server on rabbitmq01.eqiad1.wikimediacloud.org:5671 is unreachable: EOF occurred in violation of protocol (_ssl.c:2393). Trying again in 0 seconds.: ssl.SSLEOFError: EOF occurred in violation of protocol >
Feb 05 10:01:16 cloudcontrol1006 nova-api-wsgi[985534]: 2024-02-05 10:01:16.027 985534 INFO oslo.messaging._drivers.impl_rabbit [-] [8ea3e25f-35ca-42a1-83a3-31c95bf8e76d] Reconnected to AMQP server on rabbitmq01.eqiad1.wikimediacloud.org:5671 via [amqp] client with port 46626.
Feb 05 10:05:02 cloudcontrol1006 nova-api-wsgi[902532]: 2024-02-05 10:05:02.592 902532 ERROR oslo.messaging._drivers.impl_rabbit [-] [01586216-7dd4-4b9a-b924-d706dcf3a8a5] AMQP server on rabbitmq03.eqiad1.wikimediacloud.org:5671 is unreachable: EOF occurred in violation of protocol (_ssl.c:2393). Trying again in 0 seconds.: ssl.SSLEOFError: EOF occurred in violation of protocol >
Feb 05 10:05:02 cloudcontrol1006 nova-api-wsgi[902532]: 2024-02-05 10:05:02.750 902532 INFO oslo.messaging._drivers.impl_rabbit [-] [01586216-7dd4-4b9a-b924-d706dcf3a8a5] Reconnected to AMQP server on rabbitmq03.eqiad1.wikimediacloud.org:5671 via [amqp] client with port 52616.
Feb 05 10:07:50 cloudcontrol1006 nova-api-wsgi[902532]: 2024-02-05 10:07:50.794 902532 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: EOF occurred in violation of protocol (_ssl.c:2393)
Feb 05 10:10:01 cloudcontrol1006 nova-api-wsgi[1341495]: 2024-02-05 10:10:01.279 1341495 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: EOF occurred in violation of protocol (_ssl.c:2393)
Feb 05 10:21:31 cloudcontrol1006 nova-api-wsgi[985534]: 2024-02-05 10:21:31.344 985534 ERROR oslo.messaging._drivers.impl_rabbit [-] [8ea3e25f-35ca-42a1-83a3-31c95bf8e76d] AMQP server on rabbitmq01.eqiad1.wikimediacloud.org:5671 is unreachable: Too many heartbeats missed. Trying again in 0 seconds.: amqp.exceptions.ConnectionForced: Too many heartbeats missed
Feb 05 10:21:31 cloudcontrol1006 nova-api-wsgi[985534]: 2024-02-05 10:21:31.345 985534 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: EOF occurred in violation of protocol (_ssl.c:2393)
Feb 05 10:21:31 cloudcontrol1006 nova-api-wsgi[985534]: 2024-02-05 10:21:31.348 985534 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: EOF occurred in violation of protocol (_ssl.c:2393)
Feb 05 10:21:31 cloudcontrol1006 nova-api-wsgi[985534]: 2024-02-05 10:21:31.470 985534 INFO oslo.messaging._drivers.impl_rabbit [-] [8ea3e25f-35ca-42a1-83a3-31c95bf8e76d] Reconnected to AMQP server on rabbitmq01.eqiad1.wikimediacloud.org:5671 via [amqp] client with port 54352.
Feb 05 10:23:01 cloudcontrol1006 nova-api-wsgi[1341495]: 2024-02-05 10:23:01.282 1341495 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: EOF occurred in violation of protocol (_ssl.c:2393)

After getting a pcap of the connection flow, I can see many retransmitted packages and duplicated ACKs:

Screenshot from 2024-02-05 11-32-02.png (431×1 px, 173 KB)

There's several bumps on the nova-api response times, though might not be related:

image.png (410×1 px, 89 KB)

Event Timeline

dcaro triaged this task as High priority.Feb 5 2024, 10:36 AM
dcaro created this task.

The retransmitted packages are seen by both ends (cloudcontrol and cloudrabbit), and on both ends the package is from the other end.

So it seems that somehow the switch is duplicating packages.

I think that the duplicated packages come from capturing on different interfaces (vlan1152/1154 and ens3f0np0), as the packages have to go through both, looking

Change 997921 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] rabbitmq: increase heartbeat timeout and number of heartbeats

https://gerrit.wikimedia.org/r/997921

Change 997921 merged by Andrew Bogott:

[operations/puppet@production] rabbitmq: increase heartbeat timeout and number of heartbeats

https://gerrit.wikimedia.org/r/997921

Unfortunately this did not seem to help :/

Now we get '180s' on the timeout messages, but still hapenning quite often