Page MenuHomePhabricator

Carry out controlled network switch down tests in cloud
Open, Needs TriagePublic

Description

We want to understand and observe the impact that a network switch down has on cloud, under controlled conditions. The results will give us a better idea on how to proceed with {T414835} and see how far we've come to address T375204: [cloudceph] Improve downtime when a switch goes down.

The main driving force behind these tests is ceph failure scenarios and resiliency, though considering cloud as a whole is worthwhile. There are of course a spectrum of possibilities for the tests: from simply rebooting the switch and observe the effects, to shutting down progressively more ports, to maybe something else I'm forgetting now (?)

I have reviewed the rack allocation (P88809) and I think a good candidate to start with is C8: there are no cloudvirts, relatively few ceph TB compared to the rest (150) so in theory the impact should be zero/minimal.

Questions I have in mind:

  1. To what extent shutting the individual ports differs from the switch rebooting? In terms of what other hosts on the network experience that is. What I'm getting at here is whether we can realistically and progressively simulate a switch rebooting without doing it all at once.
  2. For non-ceph hosts in C8 (namely control, gw, lb, net, rabbit, services) is automatic failover and/or minimal impact expected on switch reboot?

For 1. I'm cc'ing @ayounsi and @cmooney to help answer, whereas for 2. maybe @taavi @Andrew you have ideas/insights ?

Specifically for C8 these are the hosts in service, broken down by "failover status"

Manual failover / maint mode

Needs maintenance mode and/or manual failover (e.g. ceph noout)

cloudcephmon1004.eqiad.wmnet
cloudcephosd1016.eqiad.wmnet
cloudcephosd1017.eqiad.wmnet
cloudcephosd1018.eqiad.wmnet
cloudcephosd1021.eqiad.wmnet
cloudcephosd1022.eqiad.wmnet
cloudcephosd1035.eqiad.wmnet
cloudcephosd1042.eqiad.wmnet
cloudcephosd1043.eqiad.wmnet

Automatic failover

Will failover automatically, with some/no user impact

cloudgw1003.eqiad.wmnet
cloudlb1001.eqiad.wmnet
cloudnet1005.eqiad.wmnet
cloudservices1006.eqiad.wmnet
cloudcontrol1011.eqiad.wmnet
cloudrabbit1001.eqiad.wmnet

N/A - no failover / no user impact

No failover required/needed though no immediate user impact either

cloudbackup1003.eqiad.wmnet

Testing plan

We'll be testing a "switch reboot" scenario by progressively shutting interfaces on the C8 switch side and assess impact on services.

ceph

Ahead of the work we'll be setting the ceph cluster as ceph osd set noout to prevent data rebalance, then start with shutting one OSD and assess impact. Continue with more OSDs if no impact, then shut mon too and assess for impact. This is the most important part of the test as ceph rebalance has been historically the cause for cloud switch reboots being "scary"

gw/lb/net

These hosts are meant to be stateless by design, we'll be shutting one after the other and assess impact.

services/control/rabbit

These hosts are stateful and at least the rabbit/openstack interaction is known to be less than failure resistant (T418444). We'll also be shutting one interface after the other and assess impact

Event Timeline

For 1 the impact depends on the application, on a L1/L2 perspective, a port shutdown and switch reboot is the same thing.

gw/lb/net/services will all failover automatically, although if we want to be a bit more graceful all of them can be failed over or depooled manually as well.

In case someone finds this task in the future and wonders how to find out how many TB are on each rack, the weights are a map 1:1 to TB, so this works:

root@cloudcephmon1005:~# ceph osd tree | grep rack
-83         153.69781      rack C8                                             
-81         167.67026      rack D5                                             
-77         368.52048      rack E4                                             
-79         293.42352      rack F4

I spoke with @cmooney today and got Tues March 10th in Europe morning as a day to carry out tests in C8

Icinga downtime and Alertmanager silence (ID=5025cdeb-6797-439c-a30c-98b645a86cc9) set by filippo@cumin1003 for 4:00:00 on 19 host(s) and their services with reason: switch down tests

cloudbackup1003.eqiad.wmnet,cloudcephmon1004.eqiad.wmnet,cloudcephosd[1016-1018,1021-1022,1035,1042-1043].eqiad.wmnet,cloudcontrol1011.eqiad.wmnet,cloudgw1003.eqiad.wmnet,cloudlb1001.eqiad.wmnet,cloudnet1005.eqiad.wmnet,cloudrabbit1001.eqiad.wmnet,cloudservices1006.eqiad.wmnet,tools-k8s-ctrl1001.eqiad.wmnet,tools-k8s-worker[1001-1002].eqiad.wmnet

For reference these are the host facing interfaces we'll be operating on:

xe-0/0/3        up    up   cloudnet1005 {#20220119}
xe-0/0/5        up    up   cloudlb1001 {#11059}
xe-0/0/6        up    up   cloudcephmon1004 {#230304500102313}
xe-0/0/12       up    up   cloudcephosd1016 {#5348}
xe-0/0/14       up    up   cloudcephosd1017 {#5347}
xe-0/0/15       up    up   cloudcephosd1016 {#5349}
xe-0/0/16       up    up   cloudcephosd1018 {#5396}
xe-0/0/17       up    up   cloudcephosd1017 {#5346}
xe-0/0/19       up    up   cloudservices1006 {#5321}
xe-0/0/20       up    up   cloudcephosd1035 {#5335}
xe-0/0/21       up    up   cloudrabbit1001 {#5336}
xe-0/0/22       up    up   cloudcephosd1021 {#11034}
xe-0/0/23       up    up   cloudcephosd1021 {#11032}
xe-0/0/29       up    up   cloudcephosd1042 {#5204}
xe-0/0/30       up    up   cloudcephosd1043 {#5205}
xe-0/0/31       up    up   cloudcontrol1011 {#5200}
xe-0/0/32       up    up   cloudgw1003 {#5201}
xe-0/0/33       up    up   cloudcephosd1018 {#5397}
xe-0/0/34       up    up   cloudcephosd1022 {#11031}
xe-0/0/35       up    up   cloudcephosd1022 {#11033}
xe-0/0/41       up    up   cloudbackup1003 {#1208202106}

Mentioned in SAL (#wikimedia-operations) [2026-03-10T07:49:30Z] <godog> prep cloudsw reboot tests 'ceph osd set noout' - T417393

Mentioned in SAL (#wikimedia-operations) [2026-03-10T08:05:18Z] <godog> start disabling cloudcephosd interfaces - T417393

Mentioned in SAL (#wikimedia-operations) [2026-03-10T08:18:28Z] <godog> disabled interfaces for cloudcephosd1016 cloudcephosd1017 cloudcephosd1016 cloudcephosd1018 cloudcephosd1017 cloudcephosd1035 - T417393

Mentioned in SAL (#wikimedia-operations) [2026-03-10T08:22:56Z] <godog> disabled interfaces for cloudcephosd1021 cloudcephosd1042 cloudcephosd1043 cloudcephosd1018 cloudcephosd1022 - T417393

Mentioned in SAL (#wikimedia-operations) [2026-03-10T08:30:08Z] <godog> disabled interface for cloudcephmon1004 - T417393

Mentioned in SAL (#wikimedia-operations) [2026-03-10T09:00:00Z] <godog> restore all host interfaces - T417393

Tests have been completed for today: good news and bad news. Good news is that ceph with ceph osd noout behave as expected i.e. bringing down osd and mon hosts did not result in an outage nor network saturation due to rebalancing activity. The bad news is that bringing down net/lb/gw cloud hosts resulted in cloud networking being down. services/rabbit/control hosts have not been tested

Plan is to grab another announced maint window on Tues March 17th to resume the testing.

I have also opened subtasks for the remaining racks, one notable difference is that those do contain cloudvirt hosts. @Andrew what's the recommended procedure to temporarily drain a rack of VMs and then put them back? So far I found wmcs.openstack.cloudvirt.drain cookbook mentioned on wikitech

Plan is to grab another announced maint window on Tues March 17th to resume the testing.

I have also opened subtasks for the remaining racks, one notable difference is that those do contain cloudvirt hosts. @Andrew what's the recommended procedure to temporarily drain a rack of VMs and then put them back? So far I found wmcs.openstack.cloudvirt.drain cookbook mentioned on wikitech

wmcs.openstack.cloudvirt.drain should be what you need -- it will mark migrate VMs off the host and also mark the host as in maintenance. Then to repool you'll use wmcs.openstack.cloudvirt.unset_maintenance

The only exception to this is the 'cloudvirtlocal' hosts which cannot be drained easily due to using local storage.

Oh, to check the maintenance state of a host you want to look at the host aggregates. Docs for that here: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Host_aggregates

Plan is to grab another announced maint window on Tues March 17th to resume the testing.

I have also opened subtasks for the remaining racks, one notable difference is that those do contain cloudvirt hosts. @Andrew what's the recommended procedure to temporarily drain a rack of VMs and then put them back? So far I found wmcs.openstack.cloudvirt.drain cookbook mentioned on wikitech

wmcs.openstack.cloudvirt.drain should be what you need -- it will mark migrate VMs off the host and also mark the host as in maintenance. Then to repool you'll use wmcs.openstack.cloudvirt.unset_maintenance

Thank you, I looked at cloudvirt.drain though I couldn't find an option specifically to make sure the destination host is not in the rack we are draining. Maybe not a huge issue though? The scenario I'm thinking about is we're draining a cloudvirt and all/most VMs migrate to another cloudvirt in the same rack, of course things would converge eventually at the risk of moving VMs a bunch of times.

re: cloudvirt.unset_maintenance would VMs also migrate back to their original host and/or would the rack be balanced with VMs like it was previously?

The only exception to this is the 'cloudvirtlocal' hosts which cannot be drained easily due to using local storage.

Indeed, I checked VMs on cloudvirtlocal hosts and they are all etcd for tools/toolsbeta which will survive (in theory!) brief downtime as designed

Thank you, I looked at cloudvirt.drain though I couldn't find an option specifically to make sure the destination host is not in the rack we are draining. Maybe not a huge issue though? The scenario I'm thinking about is we're draining a cloudvirt and all/most VMs migrate to another cloudvirt in the same rack, of course things would converge eventually at the risk of moving VMs a bunch of times.

Correct, those cookbooks are not rack aware at all. The most efficient process would be to run the set_maintenance script on all cloudvirts in a given rack and then drain the individual cloudvirts; that would avoid moving any VMs more than once.

re: cloudvirt.unset_maintenance would VMs also migrate back to their original host and/or would the rack be balanced with VMs like it was previously?

unset_maintenance doesn't move any VMs, it just allows the cloudvirts to receive VMs later. So after you drain a cloudvirt and then unset_maintenance it will be empty until VMs are moved or scheduled for unrelated reasons. That's generally fine, although after such a big operation I would check grafana to make sure there aren't any cloudvirts which are obviously overloaded. https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad-summary?orgId=1&from=now-7d&to=now&timezone=utc&var-hypervisor=$__all&refresh=1m

The quick summary for all these questions is: we haven't really designed workflows for bulk cloudvirt draining, only for one-offs. If it turns out that we want to be able to drain and refill whole big sets of cloudvirts in the future, we will want some new smarter cookbooks to manage that.

Thank you, I looked at cloudvirt.drain though I couldn't find an option specifically to make sure the destination host is not in the rack we are draining. Maybe not a huge issue though? The scenario I'm thinking about is we're draining a cloudvirt and all/most VMs migrate to another cloudvirt in the same rack, of course things would converge eventually at the risk of moving VMs a bunch of times.

Correct, those cookbooks are not rack aware at all. The most efficient process would be to run the set_maintenance script on all cloudvirts in a given rack and then drain the individual cloudvirts; that would avoid moving any VMs more than once.

I like that idea and will be doing the set_maintenance + drain procedure when the time comes

re: cloudvirt.unset_maintenance would VMs also migrate back to their original host and/or would the rack be balanced with VMs like it was previously?

unset_maintenance doesn't move any VMs, it just allows the cloudvirts to receive VMs later. So after you drain a cloudvirt and then unset_maintenance it will be empty until VMs are moved or scheduled for unrelated reasons. That's generally fine, although after such a big operation I would check grafana to make sure there aren't any cloudvirts which are obviously overloaded. https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad-summary?orgId=1&from=now-7d&to=now&timezone=utc&var-hypervisor=$__all&refresh=1m

The quick summary for all these questions is: we haven't really designed workflows for bulk cloudvirt draining, only for one-offs. If it turns out that we want to be able to drain and refill whole big sets of cloudvirts in the future, we will want some new smarter cookbooks to manage that.

Thank you for the explanation, makes sense to me. Agreed if rack drain becomes a more regular occurrence then it makes sense to invest in cookbooks to operate at the rack level

Mentioned in SAL (#wikimedia-cloud) [2026-03-18T08:00:18Z] <godog> start network switch failover tests - T417393

Mentioned in SAL (#wikimedia-cloud) [2026-03-18T08:58:41Z] <godog> bounce rabbit on cloudrabbit1001 - T417393

Mentioned in SAL (#wikimedia-cloud) [2026-03-18T09:25:46Z] <godog> end network switch failover tests - T417393

Tests today went significantly better: cloud vps networking stayed intact, I did start with failing over cloudgw which meant hosts using anycast addresses already failed over: lb, services. Rabbit suffered from network partition (i.e. T418444) though stopping and starting rabbit on cloudrabbit1001 eventually made things recover. cloudcontrol1011 was the last host and not tested yet

Mentioned in SAL (#wikimedia-cloud) [2026-04-08T07:00:37Z] <godog> test shutting cloudrabbit1001 network interface - T417393

Mentioned in SAL (#wikimedia-cloud) [2026-04-08T07:26:52Z] <godog> unshut cloudrabbit1001 network interface - T417393

Mentioned in SAL (#wikimedia-cloud) [2026-04-08T07:36:47Z] <godog> shut cloudcontrol1011 network interface - T417393

Mentioned in SAL (#wikimedia-cloud) [2026-04-08T07:57:42Z] <godog> unshut cloudcontrol1011 network interface - T417393

Mentioned in SAL (#wikimedia-cloud) [2026-04-08T08:14:47Z] <godog> perform more network tests on cloudrabbit1001 - T417393

Today cloudrabbit1001 and cloudcontrol1011 were tested:

  • rabbitmq itself performed as expected, i.e. all quorum queues were not lost and the remaining cloudrabbit hosts took over. The queues are not balanced ATM though rabbitmq-queues rebalance quorum is enough to rebalance if needed

2026-04-08-110459_3765x1392_scrot.png (1×3 px, 135 KB)

  • cloudcontrol nodes not in C8 (i.e. 1006/1007) though didn't seem to give up trying to connect to rabbitmq01.eqiad1.wikimediacloud.org:5671 whereas cloudcontrol1011 stopped trying to talk to rabbitmq01 as expected.

The following logs (journalctl --since -4h --grep ERROR.*oslo | grep ^Apr | uniq -c -w12 | less) show that oslo stopped trying cleanly on cloudcontrol1011 but not cloudcontrol1006 for example. cloudrabbit1001 was unplugged from the network from 7:00 to 7:26 UTC

full log here P90326

 22 Apr 08 07:03:28 cloudcontrol1006 uwsgi_python3[264036]: <frozen importlib._bootstrap>: 2026-04-08 07:03:28.705 264036 ERROR oslo.messaging._driver>
 63 Apr 08 07:04:03 cloudcontrol1006 uwsgi_python3[264037]: <frozen importlib._bootstrap>: 2026-04-08 07:04:03.417 264037 ERROR oslo.messaging._driver>
 56 Apr 08 07:05:03 cloudcontrol1006 uwsgi_python3[599587]: <frozen importlib._bootstrap>: 2026-04-08 07:05:03.168 599587 ERROR oslo.messaging._driver>
188 Apr 08 07:06:00 cloudcontrol1006 uwsgi_python3[264032]: <frozen importlib._bootstrap>: 2026-04-08 07:06:00.046 264032 ERROR oslo.messaging._driver>
306 Apr 08 07:07:00 cloudcontrol1006 uwsgi_python3[600940]: <frozen importlib._bootstrap>: {"message": "Connection failed: timed out (retrying in 0 se>
264 Apr 08 07:08:00 cloudcontrol1006 heat-engine[600230]: 2026-04-08 07:08:00.284 600230 ERROR oslo.messaging._drivers.impl_rabbit [-] [690ad05d-8719->
261 Apr 08 07:09:00 cloudcontrol1006 neutron-rpc-server[601202]: 2026-04-08 07:09:00.063 601202 ERROR oslo.messaging._drivers.impl_rabbit [-] [f98046b>
272 Apr 08 07:10:00 cloudcontrol1006 designate-worker[599602]: 2026-04-08 07:10:00.525 599602 ERROR oslo.messaging._drivers.impl_rabbit [None req-da0e>
265 Apr 08 07:11:01 cloudcontrol1006 designate-worker[599602]: 2026-04-08 07:11:01.169 599602 ERROR oslo.messaging._drivers.impl_rabbit [None req-e6f9>
257 Apr 08 07:12:00 cloudcontrol1006 neutron-rpc-server[601199]: 2026-04-08 07:12:00.012 601199 ERROR oslo.messaging._drivers.impl_rabbit [-] [e0ecd0e>
264 Apr 08 07:13:00 cloudcontrol1006 uwsgi_python3[599587]: <frozen importlib._bootstrap>: 2026-04-08 07:13:00.438 599587 ERROR oslo.messaging._driver>
264 Apr 08 07:14:00 cloudcontrol1006 uwsgi_python3[264034]: <frozen importlib._bootstrap>: 2026-04-08 07:14:00.028 264034 ERROR oslo.messaging._driver>
263 Apr 08 07:15:00 cloudcontrol1006 cinder-wsgi[263354]: 2026-04-08 07:15:00.086 263354 ERROR oslo.messaging._drivers.impl_rabbit [-] [95559846-4dee->
266 Apr 08 07:16:00 cloudcontrol1006 trove-api[3900539]: {"message": "[6c385001-64f4-4a09-8e41-ce3da8190e32] AMQP server on rabbitmq01.eqiad1.wikimedi>
266 Apr 08 07:17:00 cloudcontrol1006 uwsgi_python3[599587]: <frozen importlib._bootstrap>: 2026-04-08 07:17:00.370 599587 ERROR heat.common.wsgi [None>
264 Apr 08 07:18:01 cloudcontrol1006 trove-api[3900539]: {"message": "[6c385001-64f4-4a09-8e41-ce3da8190e32] AMQP server on rabbitmq01.eqiad1.wikimedi>
265 Apr 08 07:19:00 cloudcontrol1006 heat-engine[600233]: 2026-04-08 07:19:00.389 600233 ERROR oslo.messaging._drivers.impl_rabbit [-] [e4583d8a-267e->
265 Apr 08 07:20:01 cloudcontrol1006 heat-engine[600233]: 2026-04-08 07:20:01.047 600233 ERROR oslo.messaging._drivers.impl_rabbit [-] [e4583d8a-267e->
258 Apr 08 07:21:00 cloudcontrol1006 neutron-rpc-server[601199]: 2026-04-08 07:21:00.052 601199 ERROR oslo.messaging._drivers.impl_rabbit [-] [8fa1df9>
266 Apr 08 07:22:00 cloudcontrol1006 heat-engine[600233]: 2026-04-08 07:22:00.346 600233 ERROR oslo.messaging._drivers.impl_rabbit [-] [dbcdd6d0-1f6d->
265 Apr 08 07:23:00 cloudcontrol1006 heat-engine[600230]: 2026-04-08 07:23:00.116 600230 ERROR oslo.messaging._drivers.impl_rabbit [-] [690ad05d-8719->
258 Apr 08 07:24:00 cloudcontrol1006 uwsgi_python3[264033]: <frozen importlib._bootstrap>: 2026-04-08 07:24:00.054 264033 ERROR oslo.messaging._driver>
267 Apr 08 07:25:00 cloudcontrol1006 designate-worker[599602]: 2026-04-08 07:25:00.004 599602 ERROR oslo.messaging._drivers.impl_rabbit [None req-a4fc>
264 Apr 08 07:26:00 cloudcontrol1006 designate-worker[599602]: 2026-04-08 07:26:00.646 599602 ERROR oslo.messaging._drivers.impl_rabbit [None req-28db>
110 Apr 08 07:27:00 cloudcontrol1006 uwsgi_python3[264039]: <frozen importlib._bootstrap>: 2026-04-08 07:27:00.012 264039 ERROR oslo.messaging._driver>
  6 Apr 08 07:37:06 cloudcontrol1006 designate-producer[600459]: 2026-04-08 07:37:06.297 600459 ERROR oslo.service.backend._eventlet.loopingcall [None>
 14 Apr 08 07:38:05 cloudcontrol1006 designate-producer[600459]: 2026-04-08 07:38:05.306 600459 ERROR oslo.service.backend._eventlet.loopingcall [None>
 13 Apr 08 07:39:03 cloudcontrol1006 uwsgi_python3[599593]: <frozen importlib._bootstrap>: 2026-04-08 07:39:03.748 599593 ERROR heat.common.wsgi [None>

vs brief spike of errors (full log P90325)

  1 Apr 08 05:52:40 cloudcontrol1011 designate-sink[605489]: 2026-04-08 05:52:40.251 605489 ERROR oslo_messaging.notify.dispatcher [None req-d59be779->
 30 Apr 08 07:03:25 cloudcontrol1011 uwsgi_python3[321395]: <frozen importlib._bootstrap>: 2026-04-08 07:03:25.412 321395 ERROR oslo.messaging._driver>
 60 Apr 08 07:04:01 cloudcontrol1011 uwsgi_python3[321390]: <frozen importlib._bootstrap>: 2026-04-08 07:04:01.668 321390 ERROR oslo.messaging._driver>
  7 Apr 08 07:05:11 cloudcontrol1011 uwsgi_python3[604770]: <frozen importlib._bootstrap>: 2026-04-08 07:05:11.652 604770 ERROR oslo.messaging._driver>
209 Apr 08 07:06:17 cloudcontrol1011 neutron-rpc-server[606155]: 2026-04-08 07:06:17.188 606155 ERROR oslo.messaging._drivers.impl_rabbit [-] [904ba50>
202 Apr 08 07:07:01 cloudcontrol1011 heat-engine[605379]: 2026-04-08 07:07:01.604 605379 ERROR oslo.messaging._drivers.impl_rabbit [-] [64b4cdb3-c44b->
  2 Apr 08 07:09:11 cloudcontrol1011 nova-api-wsgi[605707]: 2026-04-08 07:09:11.548 605707 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection fa>
  7 Apr 08 07:10:05 cloudcontrol1011 nova-api-wsgi[605707]: 2026-04-08 07:10:05.541 605707 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection fa>
 14 Apr 08 07:12:02 cloudcontrol1011 nova-api-wsgi[605696]: 2026-04-08 07:12:02.908 605696 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection fa>

In other words connecting to cloudrabbit1001 from cloudcontrol1011 (same rack/vlan) resulted in "no route to host", whereas from hosts in other racks the connection timed out with no feedback from the OS or the network.

As far as cloudcontrol1011 being offline (07:36 to 07:57) goes, there were spikes up in latency for a few services as reported by https://grafana.wikimedia.org/d/UUmLqqX4k/wmcs-openstack-api-latency though all but designate seem to have recovered by themselves

cloudcontrol nodes not in C8 (i.e. 1006/1007) though didn't seem to give up trying to connect to rabbitmq01.eqiad1.wikimediacloud.org:5671 whereas cloudcontrol1011 stopped trying to talk to rabbitmq01 as expected.

I tuned oslo settings for rabbitmq timeout &c but it was a long time ago, probably before our current rabbitmq setup. So we should do some new testing and reviewing of those settings. This blog post is very old but somewhat relevant to the topic: https://medium.com/@george.shuklin/rabbit-heartbeat-timeouts-in-openstack-fa5875e0309a

cloudcontrol nodes not in C8 (i.e. 1006/1007) though didn't seem to give up trying to connect to rabbitmq01.eqiad1.wikimediacloud.org:5671 whereas cloudcontrol1011 stopped trying to talk to rabbitmq01 as expected.

I tuned oslo settings for rabbitmq timeout &c but it was a long time ago, probably before our current rabbitmq setup. So we should do some new testing and reviewing of those settings. This blog post is very old but somewhat relevant to the topic: https://medium.com/@george.shuklin/rabbit-heartbeat-timeouts-in-openstack-fa5875e0309a

Yes in terms of timeouts I was thinking of testing a shorter rpc timeout as a whole, i.e. oslo callers will timeout faster (e.g. within our alerting threshols) though I'm not sure if that is going to fix this particular issue. It might help with not having everything jam up but instead return errors faster. I followed up on this specific issue in T422820: oslo.messaging does not failover to the next rabbit host on traffic blackhole situations with more details on the observed behaviour