Page MenuHomePhabricator

Test, understand, improve rabbitmq HA
Closed, ResolvedPublic

Description

We have three dedicated rabbitmq nodes, and they're configured as a shared cluster with all the recommended settings for HA.

Nevertheless, HA only barely works. Often when a single rabbitmq node goes down a bunch of nova services get stuck forever pining for the missing node rather than failing over. If we restart all three nodes gradually in series, a bunch of messages get lost even though the whole idea of a rabbitmq cluster is that the messages are mirrored persistent.

This needs a deep dive with test cases and research to figure out what's going on and how to stabilize things.

Event Timeline

I am able to reproduce this issue in codfw1dev and have not found an obvious cause. The next steps should/could be:

  • Switch to using quorum queues, which generally seem better designed than the current mirroring behavior
  • Upgrade to Openstack version Y. There are quite a few changes in the Oslo code between X ad Y and this issue /might/ be handled there.

With Yoga packages I'm unable to reproduce this issue.

I may try moving us to quorum queues anyway since that's more future-proof.

Change 861890 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Openstack config: move oslo_messaging_rabbit into a shared template

https://gerrit.wikimedia.org/r/861890

Change 861890 merged by Andrew Bogott:

[operations/puppet@production] Openstack config: move oslo_messaging_rabbit into a shared template

https://gerrit.wikimedia.org/r/861890

Change 862323 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] OpenStack: use rabbitmq Quorum queues

https://gerrit.wikimedia.org/r/862323

Change 862323 merged by Andrew Bogott:

[operations/puppet@production] OpenStack: use rabbitmq Quorum queues

https://gerrit.wikimedia.org/r/862323

Mentioned in SAL (#wikimedia-cloud) [2022-11-30T20:03:11Z] <andrewbogott> changing all rabbitmq queues to quorum queues. Will be noisy! T318816

Change 862389 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] oslo_messaging_rabbit: increase retry and backoff by a lot

https://gerrit.wikimedia.org/r/862389

Change 862389 merged by Andrew Bogott:

[operations/puppet@production] oslo_messaging_rabbit: increase retry and backoff by a lot

https://gerrit.wikimedia.org/r/862389

There's an upstream bug in the interaction between oslo-messaging and kombu -- Oslo throws an additional exception that almost always prevents kombu from failing over to a different backend.

I'm pretty sure this issue is:

https://bugs.launchpad.net/oslo.messaging/+bug/1993149

...which seems to have a pending fix (although at the moment it's a blind revert of a previous patch).

For my future reference: the code that /should/ cause the failover is in kombu.connection:

self.maybe_switch_next()  # select next host

That line is never executed because the errback call above calls oslo_messaging/_drivers/impl_rabbit.py _recoverable_error_callback which raises an exception at

timer.check_return(_raise_timeout)

Change 863090 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Revert "oslo_messaging_rabbit: increase retry and backoff by a lot"

https://gerrit.wikimedia.org/r/863090

Change 864321 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] oslo_messaging_rabbit: kombu_reconnect_delay=0.1

https://gerrit.wikimedia.org/r/864321

Change 864321 merged by Andrew Bogott:

[operations/puppet@production] oslo_messaging_rabbit: kombu_reconnect_delay=0.1

https://gerrit.wikimedia.org/r/864321

Change 863090 merged by Andrew Bogott:

[operations/puppet@production] Revert "oslo_messaging_rabbit: increase retry and backoff by a lot"

https://gerrit.wikimedia.org/r/863090

I'm still chasing down the upstream fix for this but the above patches should resolve the issue for us.