We are seeing connectivity issues from all of the cloudcontrols to all of the cloudrabbit nodes.
This seems to still work, but the services keep reconnecting.
Some logs:
root@cloudcontrol1006:~# journalctl -n 100 -u nova-api ... Feb 05 10:00:02 cloudcontrol1006 nova-api-wsgi[902532]: 2024-02-05 10:00:02.420 902532 ERROR oslo.messaging._drivers.impl_rabbit [-] [01586216-7dd4-4b9a-b924-d706dcf3a8a5] AMQP server on rabbitmq03.eqiad1.wikimediacloud.org:5671 is unreachable: Server unexpectedly closed connection. Trying again in 0 seconds.: OSError: Server unexpectedly closed connection Feb 05 10:00:02 cloudcontrol1006 nova-api-wsgi[902532]: 2024-02-05 10:00:02.555 902532 INFO oslo.messaging._drivers.impl_rabbit [-] [01586216-7dd4-4b9a-b924-d706dcf3a8a5] Reconnected to AMQP server on rabbitmq03.eqiad1.wikimediacloud.org:5671 via [amqp] client with port 34844. Feb 05 10:00:27 cloudcontrol1006 nova-api-wsgi[985534]: 2024-02-05 10:00:27.710 985534 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: EOF occurred in violation of protocol (_ssl.c:2393) Feb 05 10:00:27 cloudcontrol1006 nova-api-wsgi[985534]: 2024-02-05 10:00:27.713 985534 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: EOF occurred in violation of protocol (_ssl.c:2393) Feb 05 10:01:15 cloudcontrol1006 nova-api-wsgi[985534]: 2024-02-05 10:01:15.894 985534 ERROR oslo.messaging._drivers.impl_rabbit [-] [8ea3e25f-35ca-42a1-83a3-31c95bf8e76d] AMQP server on rabbitmq01.eqiad1.wikimediacloud.org:5671 is unreachable: EOF occurred in violation of protocol (_ssl.c:2393). Trying again in 0 seconds.: ssl.SSLEOFError: EOF occurred in violation of protocol > Feb 05 10:01:16 cloudcontrol1006 nova-api-wsgi[985534]: 2024-02-05 10:01:16.027 985534 INFO oslo.messaging._drivers.impl_rabbit [-] [8ea3e25f-35ca-42a1-83a3-31c95bf8e76d] Reconnected to AMQP server on rabbitmq01.eqiad1.wikimediacloud.org:5671 via [amqp] client with port 46626. Feb 05 10:05:02 cloudcontrol1006 nova-api-wsgi[902532]: 2024-02-05 10:05:02.592 902532 ERROR oslo.messaging._drivers.impl_rabbit [-] [01586216-7dd4-4b9a-b924-d706dcf3a8a5] AMQP server on rabbitmq03.eqiad1.wikimediacloud.org:5671 is unreachable: EOF occurred in violation of protocol (_ssl.c:2393). Trying again in 0 seconds.: ssl.SSLEOFError: EOF occurred in violation of protocol > Feb 05 10:05:02 cloudcontrol1006 nova-api-wsgi[902532]: 2024-02-05 10:05:02.750 902532 INFO oslo.messaging._drivers.impl_rabbit [-] [01586216-7dd4-4b9a-b924-d706dcf3a8a5] Reconnected to AMQP server on rabbitmq03.eqiad1.wikimediacloud.org:5671 via [amqp] client with port 52616. Feb 05 10:07:50 cloudcontrol1006 nova-api-wsgi[902532]: 2024-02-05 10:07:50.794 902532 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: EOF occurred in violation of protocol (_ssl.c:2393) Feb 05 10:10:01 cloudcontrol1006 nova-api-wsgi[1341495]: 2024-02-05 10:10:01.279 1341495 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: EOF occurred in violation of protocol (_ssl.c:2393) Feb 05 10:21:31 cloudcontrol1006 nova-api-wsgi[985534]: 2024-02-05 10:21:31.344 985534 ERROR oslo.messaging._drivers.impl_rabbit [-] [8ea3e25f-35ca-42a1-83a3-31c95bf8e76d] AMQP server on rabbitmq01.eqiad1.wikimediacloud.org:5671 is unreachable: Too many heartbeats missed. Trying again in 0 seconds.: amqp.exceptions.ConnectionForced: Too many heartbeats missed Feb 05 10:21:31 cloudcontrol1006 nova-api-wsgi[985534]: 2024-02-05 10:21:31.345 985534 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: EOF occurred in violation of protocol (_ssl.c:2393) Feb 05 10:21:31 cloudcontrol1006 nova-api-wsgi[985534]: 2024-02-05 10:21:31.348 985534 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: EOF occurred in violation of protocol (_ssl.c:2393) Feb 05 10:21:31 cloudcontrol1006 nova-api-wsgi[985534]: 2024-02-05 10:21:31.470 985534 INFO oslo.messaging._drivers.impl_rabbit [-] [8ea3e25f-35ca-42a1-83a3-31c95bf8e76d] Reconnected to AMQP server on rabbitmq01.eqiad1.wikimediacloud.org:5671 via [amqp] client with port 54352. Feb 05 10:23:01 cloudcontrol1006 nova-api-wsgi[1341495]: 2024-02-05 10:23:01.282 1341495 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: EOF occurred in violation of protocol (_ssl.c:2393)
After getting a pcap of the connection flow, I can see many retransmitted packages and duplicated ACKs:
There's several bumps on the nova-api response times, though might not be related: