Page MenuHomePhabricator

Move all openstack rabbitmq queues to quorum
Closed, ResolvedPublic

Description

On 2026-03-24 from 14:11 to 14:46 there was a rolling reboot of cloudrabbit1* as part of regular kernel rollout, I checked the openstack (oslo, specifically) and rabbitmq logs to get a better idea on what's going on, why recovery doesn't happen automatically and why we have to restart openstack services.

The timeline is as follows:

14:11 <andrew@cumin2002>  START - Cookbook sre.hosts.reboot-single for host cloudrabbit1001.eqiad.wmnet [production]
14:17 <andrew@cumin2002>  END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit1001.eqiad.wmnet  [production]
14:27 <andrew@cumin2002>  START - Cookbook sre.hosts.reboot-single for host cloudrabbit1002.eqiad.wmnet [production]
14:33 <andrew@cumin2002>  END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit1002.eqiad.wmnet  [production]
14:39 <andrew@cumin2002>  START - Cookbook sre.hosts.reboot-single for host cloudrabbit1003.eqiad.wmnet [production]
14:46 <andrew@cumin2002>  END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit1003.eqiad.wmnet  [production]

Looking at cloudrabbit1002 logs when cloudrabbit1001 was in the process of being rebooted we can see connections coming in as expected when hosts are rebooted:

root@cloudrabbit1002:/var/log/rabbitmq# grep '23 14:[0-4]' rabbit@cloudrabbit1002.private.eqiad.wikimedia.cloud.log.1 | grep 'accepting AMQP' | uniq -c --check-chars 16
     18 2026-03-23 14:00:12.060304+00:00 [info] <0.1342619.0> accepting AMQP connection [2a02:ec80:a000:204::23]:57286 -> [2a02:ec80:a000:203::23]:5671
      4 2026-03-23 14:01:25.385271+00:00 [info] <0.1365724.0> accepting AMQP connection [2a02:ec80:a000:204::25]:55012 -> [2a02:ec80:a000:203::23]:5671
      1 2026-03-23 14:03:55.664889+00:00 [info] <0.1411756.0> accepting AMQP connection [2a02:ec80:a000:202::32]:54992 -> [2a02:ec80:a000:203::23]:5671
     10 2026-03-23 14:04:00.356675+00:00 [info] <0.1413254.0> accepting AMQP connection [2a02:ec80:a000:203::18]:60614 -> [2a02:ec80:a000:203::23]:5671
      1 2026-03-23 14:08:02.410929+00:00 [info] <0.1487130.0> accepting AMQP connection [2a02:ec80:a000:202::32]:60394 -> [2a02:ec80:a000:203::23]:5671
    463 2026-03-23 14:11:23.297636+00:00 [info] <0.1548717.0> accepting AMQP connection [2a02:ec80:a000:203::18]:54804 -> [2a02:ec80:a000:203::23]:5671
      6 2026-03-23 14:12:00.923570+00:00 [info] <0.1573383.0> accepting AMQP connection [2a02:ec80:a000:201::25]:42728 -> [2a02:ec80:a000:203::23]:5671
      1 2026-03-23 14:13:25.573314+00:00 [info] <0.1596469.0> accepting AMQP connection [2a02:ec80:a000:202::30]:52516 -> [2a02:ec80:a000:203::23]:5671
      6 2026-03-23 14:14:01.615085+00:00 [info] <0.1608258.0> accepting AMQP connection [2a02:ec80:a000:202::32]:45758 -> [2a02:ec80:a000:203::23]:5671
      1 2026-03-23 14:18:22.601746+00:00 [info] <0.1686397.0> accepting AMQP connection [2a02:ec80:a000:202::32]:57546 -> [2a02:ec80:a000:203::23]:5671
      1 2026-03-23 14:25:33.591217+00:00 [info] <0.1815094.0> accepting AMQP connection [2a02:ec80:a000:202::32]:34440 -> [2a02:ec80:a000:203::23]:5671
      1 2026-03-23 14:30:57.257880+00:00 [info] <0.20828.0> accepting AMQP connection [2a02:ec80:a000:203::18]:40896 -> [2a02:ec80:a000:203::23]:5671
      2 2026-03-23 14:31:35.976055+00:00 [info] <0.34500.0> accepting AMQP connection [2a02:ec80:a000:202::32]:43968 -> [2a02:ec80:a000:203::23]:5671
      1 2026-03-23 14:33:56.126881+00:00 [info] <0.83226.0> accepting AMQP connection [2a02:ec80:a000:201::25]:39392 -> [2a02:ec80:a000:203::23]:5671
      1 2026-03-23 14:36:23.547014+00:00 [info] <0.133872.0> accepting AMQP connection [2a02:ec80:a000:202::32]:37620 -> [2a02:ec80:a000:203::23]:5671
      1 2026-03-23 14:37:58.523434+00:00 [info] <0.166768.0> accepting AMQP connection [2a02:ec80:a000:201::25]:54542 -> [2a02:ec80:a000:203::23]:5671
      1 2026-03-23 14:38:32.483300+00:00 [info] <0.178873.0> accepting AMQP connection [2a02:ec80:a000:203::18]:55420 -> [2a02:ec80:a000:203::23]:5671
    817 2026-03-23 14:40:11.812360+00:00 [info] <0.213462.0> accepting AMQP connection [2a02:ec80:a000:203::18]:46200 -> [2a02:ec80:a000:203::23]:5671
      1 2026-03-23 14:41:06.711252+00:00 [info] <0.255151.0> accepting AMQP connection [2a02:ec80:a000:203::20]:34550 -> [2a02:ec80:a000:203::23]:5671
      3 2026-03-23 14:42:04.337455+00:00 [info] <0.271831.0> accepting AMQP connection [2a02:ec80:a000:203::18]:38034 -> [2a02:ec80:a000:203::23]:5671
      2 2026-03-23 14:43:04.031787+00:00 [info] <0.290787.0> accepting AMQP connection [2a02:ec80:a000:201::25]:36284 -> [2a02:ec80:a000:203::23]:5671
      1 2026-03-23 14:45:16.771831+00:00 [info] <0.331316.0> accepting AMQP connection [2a02:ec80:a000:203::11]:47824 -> [2a02:ec80:a000:203::23]:5671
      1 2026-03-23 14:47:57.889179+00:00 [info] <0.380931.0> accepting AMQP connection [2a02:ec80:a000:203::18]:39432 -> [2a02:ec80:a000:203::23]:5671
     12 2026-03-23 14:48:31.997867+00:00 [info] <0.391333.0> accepting AMQP connection [2a02:ec80:a000:202::32]:43678 -> [2a02:ec80:a000:203::23]:5671
      1 2026-03-23 14:49:56.410986+00:00 [info] <0.417712.0> accepting AMQP connection [2a02:ec80:a000:201::25]:52762 -> [2a02:ec80:a000:203::23]:5671

And exceptions around the same time, notably "queue not found" for reply queues

root@cloudrabbit1002:/var/log/rabbitmq# grep '23 14:[0-4]' rabbit@cloudrabbit1002.private.eqiad.wikimedia.cloud.log.1 | grep 'exception' | uniq -c --check-chars 18
    100 2026-03-23 14:11:23.468001+00:00 [error] <0.1549035.0> operation queue.declare caused a channel exception not_found: queue 'reply_16ce253548f04736bf30dd4a2fd08d8f' in vhost '/' process is stopped by supervisor
      2 2026-03-23 14:11:34.274566+00:00 [error] <0.1565308.0>     exception exit: {{badmatch,true},
   1020 2026-03-23 14:27:43.872548+00:00 [error] <0.117.0>     exception exit: {port_died,normal}
     90 2026-03-23 14:40:12.051863+00:00 [error] <0.213950.0> operation queue.declare caused a channel exception not_found: queue 'reply_81392286d45042bbb9d7642542b8be18' in vhost '/' process is stopped by supervisor
     23 2026-03-23 14:40:21.611200+00:00 [error] <0.240267.0>     exception exit: {{badmatch,true},

On the openstack side oslo.messaging reports not finding queues for replies, e.g.:

Mar 23 13:09:28 cloudcontrol1011 heat-engine[1517603]: 2026-03-23 13:09:28.068 1517603 WARNING oslo_messaging._drivers.amqpdriver [None req-740e2daa-8d4b-4c17-81a1-d8f34062742c qu-jijhhrm5us-1-lhtmprd7zm6e-kube-minion-wem4a6ihkhjs - - - - 
-] reply_93b80fa3145b47b4ba6ee0617f610e8a doesn't exist, drop reply to 447d7010eaf14e3db56aeb97c5000f36: oslo_messaging.exceptions.MessageUndeliverable
Mar 23 13:09:28 cloudcontrol1011 heat-engine[1517603]: 2026-03-23 13:09:28.069 1517603 ERROR oslo_messaging._drivers.amqpdriver [None req-740e2daa-8d4b-4c17-81a1-d8f34062742c qu-jijhhrm5us-1-lhtmprd7zm6e-kube-minion-wem4a6ihkhjs - - - - -]
 The reply 447d7010eaf14e3db56aeb97c5000f36 failed to send after 60 seconds due to a missing queue (reply_93b80fa3145b47b4ba6ee0617f610e8a). Abandoning...: oslo_messaging.exceptions.MessageUndeliverable
Mar 23 13:09:36 cloudcontrol1011 uwsgi_python3[1517175]: <frozen importlib._bootstrap>: 2026-03-23 13:09:36.447 1517175 ERROR heat.common.wsgi [None req-2f4c4250-c717-4617-96bf-10c44d6a4686 zu-uaw67kvw7p-0-puxcrkqo5rbp-kube-master-uhvhzwjx
zxs4 - - - - -] Unexpected error occurred serving API: Timed out waiting for a reply to message ID 511666be6e9f40f4a4146ecd576013f9: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 511666be6e9f40f4a4
146ecd576013f9
Mar 23 13:09:54 cloudcontrol1011 uwsgi_python3[1517170]: <frozen importlib._bootstrap>: 2026-03-23 13:09:54.814 1517170 ERROR heat.common.wsgi [None req-8faa6c0b-7321-471b-ace8-db4b47eca142 zu-qups5sh427-5-bg2uuitv3jwt-kube-minion-unvtbsm5
xdao - - - - -] Unexpected error occurred serving API: Timed out waiting for a reply to message ID 2164d5d6c41b4dd88d7012603af7011d: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 2164d5d6c41b4dd88d
7012603af7011d

The above makes me think that on a rabbitmq node going down its openstack transient queues (reply, fanout) also go down, which makes sense because they are declared classic and not quorum by default.

I checked what upstream does in ansible-kolla and they enable quorum queues for all types unconditionally now. For reference, here upstream added transient quorum queues to kolla-ansible as an option

I think we should be doing the same: switch all queues to quorum, in other words:

use_queue_manager = true
rabbit_transient_quorum_queue = true

And possibly rabbit_stream_fanout = true too

Event Timeline

because they are declared classic and not quorum by default.

In theory, openstack is creating these queues as quorum;

[oslo_messaging_rabbit]
rabbit_quorum_queue=true
rabbit_retry_interval=1
rabbit_retry_backoff=2

So we may need to spend some time in the oslo code and understand why that flag isn't being honored. Or we could just change rabbit as you suggest and not get too curious about what oslo is doing.

The oslo setting I mentioned rabbit_transient_quorum_queue refers to the transient queues (reply, fanout) that openstack manages on rabbit, as opposed to the "service" queues which are indeed already quorum.

root@cloudrabbit2001-dev:~# rabbitmqctl list_queues name type durable  --vhost /  | grep -c quorum
221
root@cloudrabbit2001-dev:~# rabbitmqctl list_queues name type durable  --vhost /  | grep -cv quorum
286
root@cloudrabbit2001-dev:~# rabbitmqctl list_queues name type durable  --vhost /  | wc -l
507

Change #1261374 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] openstack: enable rabbit transient quorum queues

https://gerrit.wikimedia.org/r/1261374

Change #1264557 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] openstack: enable trove-guestagent rabbit transient quorum queues

https://gerrit.wikimedia.org/r/1264557

Change #1261374 merged by Filippo Giunchedi:

[operations/puppet@production] openstack: enable rabbit transient quorum queues

https://gerrit.wikimedia.org/r/1261374

Change #1264577 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] openstack: set oslo lock path where missing

https://gerrit.wikimedia.org/r/1264577

Change #1264577 merged by Filippo Giunchedi:

[operations/puppet@production] openstack: set oslo lock path where missing

https://gerrit.wikimedia.org/r/1264577

This took a bunch of tries today and despite my best attempts to mess with rabbit and oslo, openstack reacted reasonably well IMHO.

We're down to trove-guestagent queues being still classic, which is expected (https://gerrit.wikimedia.org/r/1264557) and a few neutron queues which I'm investigating why they haven't moved to quorum:

l3_agent_fanout_826b02fe208241e88bd6f56043d7180a        classic false
l3_agent_fanout_f7c55972db0448a19c09964cf6059e4b        classic false
q-agent-notifier-network-update_fanout_76031c75f1b24dc79bd5861b759f9cb7 classic false
q-agent-notifier-network-update_fanout_ac2886128dee4b9597cc511928284670 classic false
reply_10b334cd69f04b93ab5261279ea872b9  classic false
reply_ee469a8030a74c0ba78819e5b7e43739  classic false

Mentioned in SAL (#wikimedia-operations) [2026-03-30T11:51:26Z] <godog> bounce neutron-l3-agent on cloundnet1005 - T421054

The final bits of flipping neutron-l3-agent to quorum queues will be done tomorrow at 7 UTC within a scheduled window. The actual work to be performed:

# stop neutron-l3-agent on both cloudnet, disable puppet
# on e.g. cloudrabbit1002
rabbitmqadmin delete exchange name='q-agent-notifier-network-update_fanout'

rabbitmqctl delete_queue q-agent-notifier-network-update_fanout_ac2886128dee4b9597cc511928284670
rabbitmqctl delete_queue reply_ee469a8030a74c0ba78819e5b7e43739
# enable puppet on one cloudnet, restart neutron-l3-agent and journalctl -u neutron-l3-agent to validate startup
# do the same on the other cloudnet

Ok now all queues but trove-guestagent are using quorum/durable, I'll be looking into deploying that too next.

Also I'll be looking into what is causing the sharp decline of rabbit nodes space

2026-03-31-093443_3092x828_scrot.png (828×3 px, 68 KB)

Note that despite the graph we have ~300G free on the VG, plus the vg0/srv LV is not really used and we can reclaim its space in case it is needed:

root@cloudrabbit1002:~# vgs
  VG  #PV #LV #SN Attr   VSize  VFree  
  vg0   1   3   0 wz--n- <1.75t 357.54g
root@cloudrabbit1002:~# lvs
  LV   VG  Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root vg0 -wi-ao----  74.50g                                                    
  srv  vg0 -wi-ao----   1.32t                                                    
  swap vg0 -wi-ao---- 976.00m

Change #1264557 merged by Filippo Giunchedi:

[operations/puppet@production] openstack: enable trove-guestagent rabbit transient quorum queues

https://gerrit.wikimedia.org/r/1264557

Note that despite the graph we have ~300G free on the VG, plus the vg0/srv LV is not really used and we can reclaim its space in case it is needed

Is it a leftover from an old partman recipe or something?

Note that despite the graph we have ~300G free on the VG, plus the vg0/srv LV is not really used and we can reclaim its space in case it is needed

Is it a leftover from an old partman recipe or something?

Not as far as I'm aware, the standard partman recipes leave unallocated space on vg0 on purpose though I don't know re: /srv being essentially empty

Mentioned in SAL (#wikimedia-cloud) [2026-04-01T08:15:41Z] <godog> extend cloudrabbit1* root with an additional 100G - T421054

I did some digging and the space is used by raft quorum logs via shared wal -> per-queue segments. The segments are single files on the filesystem and not actually deleted until the segment is full: https://www.rabbitmq.com/docs/quorum-queues#resource-use

Indeed the space now looks like a sawtooth pattern as segment files are deleted:

2026-04-01-100953_1264x651_scrot.png (651×1 px, 39 KB)

Just to be on the safe side I've extended vg0/root with an additional 100G on cloudrabbit1*

# lvextend --resizefs --size +100G vg0/root
  File system ext4 found on vg0/root mounted at /.
  Size of logical volume vg0/root changed from 74.50 GiB (19073 extents) to 174.50 GiB (44673 extents).
  Extending file system ext4 to 174.50 GiB (187372142592 bytes) on vg0/root...
resize2fs /dev/vg0/root
resize2fs 1.47.2 (1-Jan-2025)
Filesystem at /dev/vg0/root is mounted on /; on-line resizing required
old_desc_blocks = 10, new_desc_blocks = 22
The filesystem on /dev/vg0/root is now 45745152 (4k) blocks long.

resize2fs done
  Extended file system ext4 on vg0/root.
  Logical volume vg0/root successfully resized.

And will keep an eye on the space over the next few days. The extended root though is non-standard and for now will do, depending on disk usage we might consider moving rabbit data to /srv which is designed with more space in mind.

taavi triaged this task as High priority.Apr 1 2026, 1:57 PM

Mentioned in SAL (#wikimedia-cloud) [2026-04-03T10:18:55Z] <godog> move codfw neutron l3-agent queues to quorum - T421054

FS utilization has stabilized as segments are reclaimed

2026-04-07-090801_1889x529_scrot.png (529×1 px, 47 KB)

fgiunchedi claimed this task.

This is done in eqiad and codfw

root@cloudrabbit1002:~# rabbitmqctl list_queues name type durable  --vhost /  | grep -v quorum
Timeout: 60.0 seconds ...
Listing queues for vhost / ...
name	type	durable
root@cloudrabbit2001-dev:~# rabbitmqctl list_queues name type durable  --vhost /  | grep -v quorum
Timeout: 60.0 seconds ...
Listing queues for vhost / ...
name	type	durable

The disk space takes ~4GB between cycles thus I think we're fine even with the standard lv size for /