Yesterday at ~20:30 UTC openstack api started experiencing increased latency, e.g.
The recovery came in when @Andrew rebuilt the rabbitmq cluster and things started to get better
03:46 <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for all services [admin] 03:31 <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for all services [admin] 03:30 <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.rabbitmq.rebuild_rabbit_cluster (exit_code=0) on deployment eqiad1 [admin] 03:27 <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.rabbitmq.rebuild_rabbit_cluster on deployment eqiad1 [admin] 03:26 <andrewbogott> rebuilding the rabbitmq cluster in eqiad1; many failed messages
Opening a task to track the investigation on what happened.
A few details I could find so far:
- Rabbitmq servers were roll-restarted by puppet due to cert refresh, in this order:
2026-02-25T20:30:15.072499+00:00 cloudrabbit1003 puppet-agent[2944687]: (/Stage[main]/Rabbitmq/Service[rabbitmq-server]) Triggered 'refresh' from 1 event 2026-02-25T20:34:56.393823+00:00 cloudrabbit1001 puppet-agent[3012200]: (/Stage[main]/Rabbitmq/Service[rabbitmq-server]) Triggered 'refresh' from 1 event 2026-02-25T21:02:58.052779+00:00 cloudrabbit1002 puppet-agent[2979511]: (/Stage[main]/Rabbitmq/Service[rabbitmq-server]) Triggered 'refresh' from 1 event
i.e. 1003 and 1001 ~4m apart and finally 1002 about 30m after
- My understanding is that newer rabbitmq is able to pick up new certs without a restart; while it would work and we should do it, ultimately it is a bandaid because it seems we can't safely roll-restart rabbitmq (did we used to be able to?)
- This issue (rabbitmq restarts -> openstack api high latency) has started showing up on Dec 20th 2025, and since then has happened once/twice a month depending on exact certificate expiration, using sum(rate(rabbitmq_channel_messages_unroutable_returned_total[5m])) > 0 as signal: https://grafana.wikimedia.org/goto/EGxtl1ODg?orgId=1
And OpenstackAPIResponse alert has fired consistently due to this issue. Note that the alert fired on Dec 3rd too, and there was indeed some elevated api latency at the time, though I'm not convinced it was due to this issue
- Response from https://sal.toolforge.org/admin so far has been to restart openstack and/or rebuild the rabbit cluster
@Andrew what do you think of the above ? The timeline lines up with the latest rounds of Debian upgrades and/or Openstack upgrades in T406516




