On 2026-03-24 from 14:11 to 14:46 there was a rolling reboot of cloudrabbit1* as part of regular kernel rollout, I checked the openstack (oslo, specifically) and rabbitmq logs to get a better idea on what's going on, why recovery doesn't happen automatically and why we have to restart openstack services.
The timeline is as follows:
14:11 <andrew@cumin2002> START - Cookbook sre.hosts.reboot-single for host cloudrabbit1001.eqiad.wmnet [production] 14:17 <andrew@cumin2002> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit1001.eqiad.wmnet [production] 14:27 <andrew@cumin2002> START - Cookbook sre.hosts.reboot-single for host cloudrabbit1002.eqiad.wmnet [production] 14:33 <andrew@cumin2002> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit1002.eqiad.wmnet [production] 14:39 <andrew@cumin2002> START - Cookbook sre.hosts.reboot-single for host cloudrabbit1003.eqiad.wmnet [production] 14:46 <andrew@cumin2002> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit1003.eqiad.wmnet [production]
Looking at cloudrabbit1002 logs when cloudrabbit1001 was in the process of being rebooted we can see connections coming in as expected when hosts are rebooted:
root@cloudrabbit1002:/var/log/rabbitmq# grep '23 14:[0-4]' rabbit@cloudrabbit1002.private.eqiad.wikimedia.cloud.log.1 | grep 'accepting AMQP' | uniq -c --check-chars 16
18 2026-03-23 14:00:12.060304+00:00 [info] <0.1342619.0> accepting AMQP connection [2a02:ec80:a000:204::23]:57286 -> [2a02:ec80:a000:203::23]:5671
4 2026-03-23 14:01:25.385271+00:00 [info] <0.1365724.0> accepting AMQP connection [2a02:ec80:a000:204::25]:55012 -> [2a02:ec80:a000:203::23]:5671
1 2026-03-23 14:03:55.664889+00:00 [info] <0.1411756.0> accepting AMQP connection [2a02:ec80:a000:202::32]:54992 -> [2a02:ec80:a000:203::23]:5671
10 2026-03-23 14:04:00.356675+00:00 [info] <0.1413254.0> accepting AMQP connection [2a02:ec80:a000:203::18]:60614 -> [2a02:ec80:a000:203::23]:5671
1 2026-03-23 14:08:02.410929+00:00 [info] <0.1487130.0> accepting AMQP connection [2a02:ec80:a000:202::32]:60394 -> [2a02:ec80:a000:203::23]:5671
463 2026-03-23 14:11:23.297636+00:00 [info] <0.1548717.0> accepting AMQP connection [2a02:ec80:a000:203::18]:54804 -> [2a02:ec80:a000:203::23]:5671
6 2026-03-23 14:12:00.923570+00:00 [info] <0.1573383.0> accepting AMQP connection [2a02:ec80:a000:201::25]:42728 -> [2a02:ec80:a000:203::23]:5671
1 2026-03-23 14:13:25.573314+00:00 [info] <0.1596469.0> accepting AMQP connection [2a02:ec80:a000:202::30]:52516 -> [2a02:ec80:a000:203::23]:5671
6 2026-03-23 14:14:01.615085+00:00 [info] <0.1608258.0> accepting AMQP connection [2a02:ec80:a000:202::32]:45758 -> [2a02:ec80:a000:203::23]:5671
1 2026-03-23 14:18:22.601746+00:00 [info] <0.1686397.0> accepting AMQP connection [2a02:ec80:a000:202::32]:57546 -> [2a02:ec80:a000:203::23]:5671
1 2026-03-23 14:25:33.591217+00:00 [info] <0.1815094.0> accepting AMQP connection [2a02:ec80:a000:202::32]:34440 -> [2a02:ec80:a000:203::23]:5671
1 2026-03-23 14:30:57.257880+00:00 [info] <0.20828.0> accepting AMQP connection [2a02:ec80:a000:203::18]:40896 -> [2a02:ec80:a000:203::23]:5671
2 2026-03-23 14:31:35.976055+00:00 [info] <0.34500.0> accepting AMQP connection [2a02:ec80:a000:202::32]:43968 -> [2a02:ec80:a000:203::23]:5671
1 2026-03-23 14:33:56.126881+00:00 [info] <0.83226.0> accepting AMQP connection [2a02:ec80:a000:201::25]:39392 -> [2a02:ec80:a000:203::23]:5671
1 2026-03-23 14:36:23.547014+00:00 [info] <0.133872.0> accepting AMQP connection [2a02:ec80:a000:202::32]:37620 -> [2a02:ec80:a000:203::23]:5671
1 2026-03-23 14:37:58.523434+00:00 [info] <0.166768.0> accepting AMQP connection [2a02:ec80:a000:201::25]:54542 -> [2a02:ec80:a000:203::23]:5671
1 2026-03-23 14:38:32.483300+00:00 [info] <0.178873.0> accepting AMQP connection [2a02:ec80:a000:203::18]:55420 -> [2a02:ec80:a000:203::23]:5671
817 2026-03-23 14:40:11.812360+00:00 [info] <0.213462.0> accepting AMQP connection [2a02:ec80:a000:203::18]:46200 -> [2a02:ec80:a000:203::23]:5671
1 2026-03-23 14:41:06.711252+00:00 [info] <0.255151.0> accepting AMQP connection [2a02:ec80:a000:203::20]:34550 -> [2a02:ec80:a000:203::23]:5671
3 2026-03-23 14:42:04.337455+00:00 [info] <0.271831.0> accepting AMQP connection [2a02:ec80:a000:203::18]:38034 -> [2a02:ec80:a000:203::23]:5671
2 2026-03-23 14:43:04.031787+00:00 [info] <0.290787.0> accepting AMQP connection [2a02:ec80:a000:201::25]:36284 -> [2a02:ec80:a000:203::23]:5671
1 2026-03-23 14:45:16.771831+00:00 [info] <0.331316.0> accepting AMQP connection [2a02:ec80:a000:203::11]:47824 -> [2a02:ec80:a000:203::23]:5671
1 2026-03-23 14:47:57.889179+00:00 [info] <0.380931.0> accepting AMQP connection [2a02:ec80:a000:203::18]:39432 -> [2a02:ec80:a000:203::23]:5671
12 2026-03-23 14:48:31.997867+00:00 [info] <0.391333.0> accepting AMQP connection [2a02:ec80:a000:202::32]:43678 -> [2a02:ec80:a000:203::23]:5671
1 2026-03-23 14:49:56.410986+00:00 [info] <0.417712.0> accepting AMQP connection [2a02:ec80:a000:201::25]:52762 -> [2a02:ec80:a000:203::23]:5671And exceptions around the same time, notably "queue not found" for reply queues
root@cloudrabbit1002:/var/log/rabbitmq# grep '23 14:[0-4]' rabbit@cloudrabbit1002.private.eqiad.wikimedia.cloud.log.1 | grep 'exception' | uniq -c --check-chars 18
100 2026-03-23 14:11:23.468001+00:00 [error] <0.1549035.0> operation queue.declare caused a channel exception not_found: queue 'reply_16ce253548f04736bf30dd4a2fd08d8f' in vhost '/' process is stopped by supervisor
2 2026-03-23 14:11:34.274566+00:00 [error] <0.1565308.0> exception exit: {{badmatch,true},
1020 2026-03-23 14:27:43.872548+00:00 [error] <0.117.0> exception exit: {port_died,normal}
90 2026-03-23 14:40:12.051863+00:00 [error] <0.213950.0> operation queue.declare caused a channel exception not_found: queue 'reply_81392286d45042bbb9d7642542b8be18' in vhost '/' process is stopped by supervisor
23 2026-03-23 14:40:21.611200+00:00 [error] <0.240267.0> exception exit: {{badmatch,true},On the openstack side oslo.messaging reports not finding queues for replies, e.g.:
Mar 23 13:09:28 cloudcontrol1011 heat-engine[1517603]: 2026-03-23 13:09:28.068 1517603 WARNING oslo_messaging._drivers.amqpdriver [None req-740e2daa-8d4b-4c17-81a1-d8f34062742c qu-jijhhrm5us-1-lhtmprd7zm6e-kube-minion-wem4a6ihkhjs - - - - -] reply_93b80fa3145b47b4ba6ee0617f610e8a doesn't exist, drop reply to 447d7010eaf14e3db56aeb97c5000f36: oslo_messaging.exceptions.MessageUndeliverable Mar 23 13:09:28 cloudcontrol1011 heat-engine[1517603]: 2026-03-23 13:09:28.069 1517603 ERROR oslo_messaging._drivers.amqpdriver [None req-740e2daa-8d4b-4c17-81a1-d8f34062742c qu-jijhhrm5us-1-lhtmprd7zm6e-kube-minion-wem4a6ihkhjs - - - - -] The reply 447d7010eaf14e3db56aeb97c5000f36 failed to send after 60 seconds due to a missing queue (reply_93b80fa3145b47b4ba6ee0617f610e8a). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable Mar 23 13:09:36 cloudcontrol1011 uwsgi_python3[1517175]: <frozen importlib._bootstrap>: 2026-03-23 13:09:36.447 1517175 ERROR heat.common.wsgi [None req-2f4c4250-c717-4617-96bf-10c44d6a4686 zu-uaw67kvw7p-0-puxcrkqo5rbp-kube-master-uhvhzwjx zxs4 - - - - -] Unexpected error occurred serving API: Timed out waiting for a reply to message ID 511666be6e9f40f4a4146ecd576013f9: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 511666be6e9f40f4a4 146ecd576013f9 Mar 23 13:09:54 cloudcontrol1011 uwsgi_python3[1517170]: <frozen importlib._bootstrap>: 2026-03-23 13:09:54.814 1517170 ERROR heat.common.wsgi [None req-8faa6c0b-7321-471b-ace8-db4b47eca142 zu-qups5sh427-5-bg2uuitv3jwt-kube-minion-unvtbsm5 xdao - - - - -] Unexpected error occurred serving API: Timed out waiting for a reply to message ID 2164d5d6c41b4dd88d7012603af7011d: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 2164d5d6c41b4dd88d 7012603af7011d
The above makes me think that on a rabbitmq node going down its openstack transient queues (reply, fanout) also go down, which makes sense because they are declared classic and not quorum by default.
I checked what upstream does in ansible-kolla and they enable quorum queues for all types unconditionally now. For reference, here upstream added transient quorum queues to kolla-ansible as an option
I think we should be doing the same: switch all queues to quorum, in other words:
use_queue_manager = true rabbit_transient_quorum_queue = true
And possibly rabbit_stream_fanout = true too


