Page MenuHomePhabricator

Increased openstack latency and rabbitmq rolling restarts on certificate update
Closed, ResolvedPublic

Assigned To
Authored By
fgiunchedi
Feb 26 2026, 7:52 AM
Referenced Files
F72444296: 2026-02-27-090616_3711x1346_scrot.png
Feb 27 2026, 8:17 AM
F72444327: 2026-02-27-091017_3456x1281_scrot.png
Feb 27 2026, 8:17 AM
F72436286: image.png
Feb 26 2026, 10:17 AM
F72436276: image.png
Feb 26 2026, 10:17 AM
F72435393: 2026-02-26-084953_1255x1545_scrot.png
Feb 26 2026, 7:52 AM

Description

Yesterday at ~20:30 UTC openstack api started experiencing increased latency, e.g.

2026-02-26-084953_1255x1545_scrot.png (1,255×1,545 px, 218 KB)

The recovery came in when @Andrew rebuilt the rabbitmq cluster and things started to get better

03:46	<andrew@cloudcumin1001>	END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for all services	[admin]
03:31	<andrew@cloudcumin1001>	START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for all services	[admin]
03:30	<andrew@cloudcumin1001>	END (PASS) - Cookbook wmcs.openstack.rabbitmq.rebuild_rabbit_cluster (exit_code=0) on deployment eqiad1	[admin]
03:27	<andrew@cloudcumin1001>	START - Cookbook wmcs.openstack.rabbitmq.rebuild_rabbit_cluster on deployment eqiad1	[admin]
03:26	<andrewbogott>	rebuilding the rabbitmq cluster in eqiad1; many failed messages

Opening a task to track the investigation on what happened.

A few details I could find so far:

  • Rabbitmq servers were roll-restarted by puppet due to cert refresh, in this order:
2026-02-25T20:30:15.072499+00:00 cloudrabbit1003 puppet-agent[2944687]: (/Stage[main]/Rabbitmq/Service[rabbitmq-server]) Triggered 'refresh' from 1 event
2026-02-25T20:34:56.393823+00:00 cloudrabbit1001 puppet-agent[3012200]: (/Stage[main]/Rabbitmq/Service[rabbitmq-server]) Triggered 'refresh' from 1 event
2026-02-25T21:02:58.052779+00:00 cloudrabbit1002 puppet-agent[2979511]: (/Stage[main]/Rabbitmq/Service[rabbitmq-server]) Triggered 'refresh' from 1 event

i.e. 1003 and 1001 ~4m apart and finally 1002 about 30m after

  • My understanding is that newer rabbitmq is able to pick up new certs without a restart; while it would work and we should do it, ultimately it is a bandaid because it seems we can't safely roll-restart rabbitmq (did we used to be able to?)
  • This issue (rabbitmq restarts -> openstack api high latency) has started showing up on Dec 20th 2025, and since then has happened once/twice a month depending on exact certificate expiration, using sum(rate(rabbitmq_channel_messages_unroutable_returned_total[5m])) > 0 as signal: https://grafana.wikimedia.org/goto/EGxtl1ODg?orgId=1

2026-02-27-091017_3456x1281_scrot.png (3,456×1,281 px, 123 KB)

And OpenstackAPIResponse alert has fired consistently due to this issue. Note that the alert fired on Dec 3rd too, and there was indeed some elevated api latency at the time, though I'm not convinced it was due to this issue

2026-02-27-090616_3711x1346_scrot.png (3,711×1,346 px, 151 KB)

@Andrew what do you think of the above ? The timeline lines up with the latest rounds of Debian upgrades and/or Openstack upgrades in T406516

Event Timeline

On the rabbitmq nodes, there's a spike in the amount of connections ata 20:30:

root@cloudrabbit1001:~# for i in $(seq 10 59); do echo -n "20:$i:XX - "; grep "20:$i" /var/log/rabbitmq/rabbit@$HOSTNAME.private.eqiad.wikimedia.cloud.log.1 | wc -c; done; for i in $(seq 1 9); do echo -n "21:0$i:XX - "; grep "21:0$i" /var/log/rabbitmq/rabbit@$HOSTNAME.private.eqiad.wikimedia.cloud.log.1 | wc -c; done
20:10:XX - 690
20:11:XX - 0
20:12:XX - 0
20:13:XX - 1258
20:14:XX - 0
20:15:XX - 0
20:16:XX - 613
20:17:XX - 0
20:18:XX - 344
20:19:XX - 1198
20:20:XX - 0
20:21:XX - 261
20:22:XX - 0
20:23:XX - 1258
20:24:XX - 605
20:25:XX - 0
20:26:XX - 0
20:27:XX - 344
20:28:XX - 0
20:29:XX - 0
20:30:XX - 1356589   <- reboot of rabbit on cloudrabbit1003
20:31:XX - 40863
20:32:XX - 3675
20:33:XX - 3840
20:34:XX - 2877700   <- reboot of rabbit on cloudrabbit1001
20:35:XX - 23041
20:36:XX - 846
20:37:XX - 1003
20:38:XX - 0
20:39:XX - 465
20:40:XX - 2263
20:41:XX - 604
20:42:XX - 0
20:43:XX - 256
20:44:XX - 608
20:45:XX - 0
20:46:XX - 2056
20:47:XX - 0
20:48:XX - 0
20:49:XX - 0
20:50:XX - 0
20:51:XX - 0
20:52:XX - 0
20:53:XX - 0
20:54:XX - 867
20:55:XX - 599
20:56:XX - 0
20:57:XX - 0
20:58:XX - 0
20:59:XX - 619
21:01:XX - 1846
21:02:XX - 1622772   <- reboot of rabbit on cloudrabbit1002
21:03:XX - 19772
21:04:XX - 14480
21:05:XX - 590
21:06:XX - 251
21:07:XX - 1611
21:08:XX - 1640
21:09:XX - 686
root@cloudrabbit1002:~# for i in $(seq 10 59); do echo -n "20:$i:XX - "; grep "20:$i" /var/log/rabbitmq/rabbit@$HOSTNAME.private.eqiad.wikimedia.cloud.log.1 | wc -c; done; for i in $(seq 1 9); do echo -n "21:0$i:XX - "; grep "21:0$i" /var/log/rabbitmq/rabbit@$HOSTNAME.private.eqiad.wikimedia.cloud.log.1 | wc -c; done
20:10:XX - 345
20:11:XX - 0
20:12:XX - 0
20:13:XX - 0
20:14:XX - 0
20:15:XX - 0
20:16:XX - 0
20:17:XX - 0
20:18:XX - 0
20:19:XX - 599
20:20:XX - 0
20:21:XX - 0
20:22:XX - 0
20:23:XX - 2333
20:24:XX - 1570
20:25:XX - 0
20:26:XX - 0
20:27:XX - 955
20:28:XX - 0
20:29:XX - 0
20:30:XX - 1194032   <- reboot of rabbit on cloudrabbit1003
20:31:XX - 84164
20:32:XX - 9590
20:33:XX - 4225
20:34:XX - 1008097   <- reboot of rabbit on cloudrabbit1001
20:35:XX - 7686
20:36:XX - 1982
20:37:XX - 1719
20:38:XX - 4178
20:39:XX - 6157
20:40:XX - 4952
20:41:XX - 2414
20:42:XX - 6519
20:43:XX - 1731
20:44:XX - 628
20:45:XX - 617
20:46:XX - 955
20:47:XX - 1991
20:48:XX - 1772
20:49:XX - 3314
20:50:XX - 1909
20:51:XX - 628
20:52:XX - 1024
20:53:XX - 5136
20:54:XX - 893
20:55:XX - 602
20:56:XX - 0
20:57:XX - 0
20:58:XX - 617
20:59:XX - 0
21:01:XX - 2380
21:02:XX - 3037871   <- reboot of rabbit on cloudrabbit1002
21:03:XX - 47451
21:04:XX - 36003
21:05:XX - 0
21:06:XX - 0
21:07:XX - 0
21:08:XX - 0
21:09:XX - 0
root@cloudrabbit1003:~# for i in $(seq 10 59); do echo -n "20:$i:XX - "; grep "20:$i" /var/log/rabbitmq/rabbit@$HOSTNAME.private.eqiad.wikimedia.cloud.log.1 | wc -c; done; for i in $(seq 1 9); do echo -n "21:0$i:XX - "; grep "21:0$i" /var/log/rabbitmq/rabbit@$HOSTNAME.private.eqiad.wikimedia.cloud.log.1 | wc -c; done
20:10:XX - 0
20:11:XX - 0
20:12:XX - 0
20:13:XX - 0
20:14:XX - 0
20:15:XX - 0
20:16:XX - 0
20:17:XX - 0
20:18:XX - 0
20:19:XX - 599
20:20:XX - 0
20:21:XX - 344
20:22:XX - 0
20:23:XX - 290
20:24:XX - 0
20:25:XX - 0
20:26:XX - 0
20:27:XX - 0
20:28:XX - 0
20:29:XX - 0
20:30:XX - 1558060   <- reboot of rabbit on cloudrabbit1003
20:31:XX - 9257
20:32:XX - 2299
20:33:XX - 0
20:34:XX - 1147046   <- reboot of rabbit on cloudrabbit1001
20:35:XX - 17784
20:36:XX - 1922
20:37:XX - 608
20:38:XX - 1770
20:39:XX - 2174
20:40:XX - 3513
20:41:XX - 604
20:42:XX - 1331
20:43:XX - 656
20:44:XX - 0
20:45:XX - 0
20:46:XX - 1883
20:47:XX - 1818
20:48:XX - 1877
20:49:XX - 1862
20:50:XX - 500
20:51:XX - 503
20:52:XX - 0
20:53:XX - 0
20:54:XX - 1868
20:55:XX - 2157
20:56:XX - 0
20:57:XX - 619
20:58:XX - 0
20:59:XX - 0
21:01:XX - 947
21:02:XX - 1200736   <- reboot of rabbit on cloudrabbit1002
21:03:XX - 20766
21:04:XX - 15725
21:05:XX - 1819
21:06:XX - 619
21:07:XX - 684
21:08:XX - 868
21:09:XX - 3487

Interestingly enough, for designate-api (and keystone), the graphs show changes a few minutes before that already:

image.png (2,707×422 px, 115 KB)

Though others like heat seems to start changing behavior on the first reload (cloudrabbit1003):

image.png (2,709×406 px, 131 KB)

After the reboots, I see some of these errors on the rabbitmq logs:

2026-02-25 21:02:45.263581+00:00 [error] <0.426320.0>     supervisor: {<0.426320.0>,rabbit_amqqueue_sup}
2026-02-25 21:02:45.263581+00:00 [error] <0.426320.0>     errorContext: child_terminated
2026-02-25 21:02:45.263581+00:00 [error] <0.426320.0>     reason: {{badmatch,true},
2026-02-25 21:02:45.263581+00:00 [error] <0.426320.0>              [{rabbit_classic_queue_index_v2,init,3,
2026-02-25 21:02:45.263581+00:00 [error] <0.426320.0>                   [{file,"rabbit_classic_queue_index_v2.erl"},{line,172}]},
2026-02-25 21:02:45.263581+00:00 [error] <0.426320.0>               {rabbit_variable_queue,init,5,
2026-02-25 21:02:45.263581+00:00 [error] <0.426320.0>                   [{file,"rabbit_variable_queue.erl"},{line,424}]},
2026-02-25 21:02:45.263581+00:00 [error] <0.426320.0>               {rabbit_priority_queue,init,3,
2026-02-25 21:02:45.263581+00:00 [error] <0.426320.0>                   [{file,"rabbit_priority_queue.erl"},{line,150}]},
2026-02-25 21:02:45.263581+00:00 [error] <0.426320.0>               {rabbit_amqqueue_process,init_it2,3,
2026-02-25 21:02:45.263581+00:00 [error] <0.426320.0>                   [{file,"rabbit_amqqueue_process.erl"},{line,216}]},
2026-02-25 21:02:45.263581+00:00 [error] <0.426320.0>               {gen_server2,handle_msg,2,
2026-02-25 21:02:45.263581+00:00 [error] <0.426320.0>                   [{file,"gen_server2.erl"},{line,1035}]},
2026-02-25 21:02:45.263581+00:00 [error] <0.426320.0>               {proc_lib,init_p_do_apply,3,
2026-02-25 21:02:45.263581+00:00 [error] <0.426320.0>                   [{file,"proc_lib.erl"},{line,329}]}]}

That's failing to initialize the queues it seems, might be related to the old (and in theory fixed) https://github.com/rabbitmq/rabbitmq-server/issues/802 that points at issues when the queue directories are not cleaned up properly when trying to change primaries and such :/

fgiunchedi renamed this task from Increased openstack latency and rabbitmq cluster rebuild to Increased openstack latency and rabbitmq rolling restarts on certificate update.Feb 27 2026, 8:17 AM
fgiunchedi triaged this task as High priority.
fgiunchedi updated the task description. (Show Details)

Change #1247588 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] openstack: do not restart rabbitmq-server on cert renewal

https://gerrit.wikimedia.org/r/1247588

Change #1247588 merged by Filippo Giunchedi:

[operations/puppet@production] openstack: do not restart rabbitmq-server on cert renewal

https://gerrit.wikimedia.org/r/1247588

I have disabled automated roll-restart for rabbit on cert renewal, and made a note to verify certs are indeed reloaded automatically in about a week (i.e. when next cert renewal is expected to occur)

Confirmed that rabbitmq reloads certs without a restart:

cloudrabbit1001:~$ sudo systemctl status rabbitmq-server | grep -i active
     Active: active (running) since Wed 2026-02-25 20:34:56 UTC; 2 weeks 4 days ago
cloudrabbit1001:~$ openssl s_client -connect localhost:5671 < /dev/null
Connecting to ::1
CONNECTED(00000003)
Can't use SSL_get_servername
depth=2 C=US, ST=California, L=San Francisco, O=Wikimedia Foundation, Inc, OU=Cloud Services, CN=Wikimedia_Internal_Root_CA
verify return:1
depth=1 C=US, L=San Francisco, O=Wikimedia Foundation, Inc, OU=SRE Foundations, CN=cloud_wmnet_ca
verify return:1
depth=0 CN=cloudrabbit1001.eqiad.wmnet
verify return:1
---
Certificate chain
 0 s:CN=cloudrabbit1001.eqiad.wmnet
   i:C=US, L=San Francisco, O=Wikimedia Foundation, Inc, OU=SRE Foundations, CN=cloud_wmnet_ca
   a:PKEY: EC, (prime256v1); sigalg: ecdsa-with-SHA512
   v:NotBefore: Mar 14 20:00:00 2026 GMT; NotAfter: Apr 11 20:00:00 2026 GMT
 1 s:C=US, L=San Francisco, O=Wikimedia Foundation, Inc, OU=SRE Foundations, CN=cloud_wmnet_ca
   i:C=US, ST=California, L=San Francisco, O=Wikimedia Foundation, Inc, OU=Cloud Services, CN=Wikimedia_Internal_Root_CA
   a:PKEY: EC, (secp521r1); sigalg: ecdsa-with-SHA512
   v:NotBefore: Dec 13 18:55:00 2021 GMT; NotAfter: Dec 12 18:55:00 2026 GMT
cloudrabbit1001:~$ sudo find /etc/rabbitmq/ssl -ls
   132491      4 drwxr-x---   2 rabbitmq rabbitmq     4096 Dec  2 22:51 /etc/rabbitmq/ssl
   132494      4 -r--r-----   1 rabbitmq rabbitmq      227 Dec  2 22:51 /etc/rabbitmq/ssl/cloud_wmnet_ca__cloudrabbit1001_eqiad_wmnet-key.pem
   132496      4 -r--r-----   1 rabbitmq rabbitmq     1338 Dec  2 22:51 /etc/rabbitmq/ssl/cloud_wmnet_ca__cloudrabbit1001_eqiad_wmnet.chain.pem
   132493      4 -r--r-----   1 rabbitmq rabbitmq      578 Mar 14 20:04 /etc/rabbitmq/ssl/cloud_wmnet_ca__cloudrabbit1001_eqiad_wmnet.csr
   132503      4 -rw-r--r--   1 rabbitmq rabbitmq     2583 Mar 14 20:04 /etc/rabbitmq/ssl/cloud_wmnet_ca__cloudrabbit1001_eqiad_wmnet.chained.pem
   132492      4 -r--r-----   1 rabbitmq rabbitmq     1245 Mar 14 20:04 /etc/rabbitmq/ssl/cloud_wmnet_ca__cloudrabbit1001_eqiad_wmnet.pem

The question remains whether a rabbitmq roll restart is reliable; will be testing post openstack upgrade

Today during T417393: Carry out controlled network switch down tests in cloud the same failure happened, namely cloudrabbit1001 was disconnected from the network and a partition was formed. When the host came back stopping and starting rabbit on the host eventually made things recover.

@dcaro pointed out we might want to try cluster_partition_handling = pause_minority for rabbit, to make it pause the minority upon a split brain, and I agree.

Something else to be investigated later is to explore quorum queues for openstack. quorum queues are already enabled in openstack

Change #1254877 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] rabbitmq: set pause_minority for cluster_partition_handling

https://gerrit.wikimedia.org/r/1254877

Change #1254877 merged by Filippo Giunchedi:

[operations/puppet@production] rabbitmq: set pause_minority for cluster_partition_handling

https://gerrit.wikimedia.org/r/1254877

Change #1258990 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] rabbit: apply cluster_partition_handling to rabbitmq4

https://gerrit.wikimedia.org/r/1258990

Change #1258990 merged by Filippo Giunchedi:

[operations/puppet@production] rabbit: apply cluster_partition_handling to rabbitmq4

https://gerrit.wikimedia.org/r/1258990

Change deployed and rabbit roll-restarted:

# rabbitmqctl eval 'application:get_env(rabbit, cluster_partition_handling).'
{ok,pause_minority}

Note that re-init of the rabbit cluster wasn't necessary nor has been performed. I will verify the rabbit behavior on the next installment of T417393: Carry out controlled network switch down tests in cloud.

fgiunchedi claimed this task.

This is done, rabbit/openstack are now able to survive a rabbit host or server process going down (i.e. all durable queues) and automatically reload certs without a restart