Page MenuHomePhabricator

Dedicated cloudrabbit nodes in eqiad1
Closed, ResolvedPublic

Description

Currently rabbitmq services are running on the cloudcontrol nodes along with many openstack services.

Rabbitmq is the messaging service that all of our openstack services use. Currently rabbitmq runs on the cloudcontrol notes, which also host many openstack services. We've been having ongoing issues with those services being in the same servers: performance issues with rabbit, and also issues with needing to do openstack maintenance and that causing rabbit issues. Rabbitmq is generally the weak point of our infra, it's somewhat touchy and unstable.

The reason it has public IPs is

This task tracks the setup and implementation of dedicated rabbitmq notes in eqiad: cloudrabbit100[1-3]. These servers will use public IPs so that services that run on-cloud (orchestration things like Trove and Magnum) which can talk to rabbit in order to coordinate with the openstack services. Rabbitmq is requires persistent low-latency connections so isn't a great candidate for proxying.

Event Timeline

Before this is closed out I want to review the firewall rules and see if we need to limit port access to VM + prod networks

Before this is closed out I want to review the firewall rules and see if we need to limit port access to VM + prod networks

This turns out to be already done; these ports showed up on the portscan because the scanner runs within cloud-vps.

Change 820465 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Remove rabbitmq profile from cloudcontrol nodes

https://gerrit.wikimedia.org/r/820465

Change 820465 merged by Andrew Bogott:

[operations/puppet@production] Remove rabbitmq profile from cloudcontrol nodes

https://gerrit.wikimedia.org/r/820465

Change 820593 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Switch back to the new cloudrabbit1xxx nodes and switch nova to TLS

https://gerrit.wikimedia.org/r/820593

Change 820593 merged by Andrew Bogott:

[operations/puppet@production] Switch back to the new cloudrabbit1xxx nodes and switch nova to TLS

https://gerrit.wikimedia.org/r/820593

Change 820598 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Revert "Switch back to the new cloudrabbit1xxx nodes and switch nova to TLS"

https://gerrit.wikimedia.org/r/820598

Change 820598 merged by Andrew Bogott:

[operations/puppet@production] Revert "Switch back to the new cloudrabbit1xxx nodes and switch nova to TLS"

https://gerrit.wikimedia.org/r/820598

ayounsi added subscribers: Cmjohnson, ayounsi.

Thanks for opening this task!

I have a few questions and possibly ideas for improvements we could implement.

Let’s start with some background.

We give public v4 IPs more scrutiny for 2 reasons:

  • They are directly exposed to the Internet, and thus have fewer safeguards if a miss-config or bug is introduced to their firewall rules or host services
  • They are scarce, and thus should be used only if there are no other options (for example if a service shouldn't depend on LVS) see also this diagram

Given the above, I would like to ask you for a bit more context on how this service works, for example:

  • How will it scale? (The current setup seems to show that the public IP need will scale linearly with the number of hosts.)
  • How will redundancy work?
  • Are clients pointed to a single host? Single IP, RR DNS, manual configuration change, manual DNS change, etc?
  • What are the high-level network flows like?

You’ve mentioned that it isn't a great candidate for proxying.
By proxying, do you include LVS? If so, that shouldn’t add significant delay. Have you performed any tests by chance to verify that?

From the little understanding I have, it seems like a service that could benefit from leveraging the existing LVS infrastructure, with all the bells and whistles that comes with it. It would be great to hear more about the project though, maybe there are some factors that I haven't taken into account, or perhaps there are some good ways to leverage our existing frameworks.

Change 820728 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Openstack: use cloudcontrol1004/1004 as rabbitmq hosts

https://gerrit.wikimedia.org/r/820728

Change 820728 merged by Andrew Bogott:

[operations/puppet@production] Openstack: use cloudcontrol1004/1004 as rabbitmq hosts

https://gerrit.wikimedia.org/r/820728

This is going very poorly.

When I switch to the new cluster, most things seem to work, but VMs never get scheduled. I can see messages getting transmitted from nova-api to nova-conductor properly but either nova-conductor can't talk to nova-scheduler OR (more likely) nova-compute isn't properly communicating with nova-scheduler.

I tried switching back to a decom-proof cluster of cloudcontrol100[567] but that exhibited the same issue.

Now I've switched to a cluster of cloudcontrol100[34] and that seems to be working. So either there's some local secret config on those hosts that fixes the communication with cloudvirts, or I'm just getting lucky and there's some random factor that sometimes works when setting up a new rabbit cluster.

Noteably, all of the affected services claim to be taking to rabbit just fine, including the compute nodes.

Since it's Friday I'm hoping that the current setup (cloudcontrol100[34]) is stable enough to last the weekend. Trove guest agents will be unhappy as they're still expecting to talk to cloudrabbit100[123] but that isn't likely to be an issue in the meantime as databases should still work fine.

taavi suggests that the difference might be two nodes vs. three nodes -- that's something to experiment with.

Change 820833 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] profile::openstack::eqiad1::rabbitmq_nodes: switch to dedicated nodes

https://gerrit.wikimedia.org/r/820833

Change 820833 merged by Andrew Bogott:

[operations/puppet@production] profile::openstack::eqiad1::rabbitmq_nodes: switch to dedicated nodes

https://gerrit.wikimedia.org/r/820833

Change 820848 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] nova.conf: use ipv4 address for rabbit hosts rather than fqdn

https://gerrit.wikimedia.org/r/820848

Change 820848 merged by Andrew Bogott:

[operations/puppet@production] nova.conf: use ipv4 address for rabbit hosts rather than fqdn

https://gerrit.wikimedia.org/r/820848

Change 820849 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] neutron.conf: use ipv4 address for rabbit hosts rather than fqdn

https://gerrit.wikimedia.org/r/820849

Change 820849 merged by Andrew Bogott:

[operations/puppet@production] neutron.conf: use ipv4 address for rabbit hosts rather than fqdn

https://gerrit.wikimedia.org/r/820849

Change 820850 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] neutron.conf: fix copy/paste error in port number

https://gerrit.wikimedia.org/r/820850

Change 820850 merged by Andrew Bogott:

[operations/puppet@production] neutron.conf: fix copy/paste error in port number

https://gerrit.wikimedia.org/r/820850

Here is a definite hint:

2022-08-06 21:35:40.231 2451963 ERROR oslo_messaging.rpc.server [req-c9f5ee93-a1c6-41a4-82f1-732852e836fa andrew admin-monitoring - default default] Exception during message handling: oslo_messaging.exceptions.MessageDeliveryFailure: Unable to connect to AMQP server on cloudcontrol1003.wikimedia.org:5672 after inf tries: Exchange.declare: (406) PRECONDITION_FAILED - inequivalent arg 'durable' for exchange 'nova' in vhost '/': received 'true' but current is 'false'

The durable exchange thing isn't interesting, but the fact that conductor is trying to talk to cloudcontrol1003 for rpc is. there's some persistent config someplace that we've missed.

Phantom config lurking in database!

mysql:galera_backup@localhost [nova_api_eqiad1]> select * from cell_mappings;
+---------------------+------------+----+--------------------------------------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------+----------+
| created_at          | updated_at | id | uuid                                 | name  | transport_url                                                                                                                                                                    | database_connection                                                                       | disabled |
+---------------------+------------+----+--------------------------------------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------+----------+
| 2019-12-02 06:28:14 | NULL       |  1 | 00000000-0000-0000-0000-000000000000 | cell0 | none:///                                                                                                                                                                         | mysql+pymysql://nova:z7sM97huea7sgH@openstack.eqiad1.wikimediacloud.org/nova_cell0_eqiad1 |        0 |
| 2019-12-12 06:42:31 | NULL       |  4 | 1ee5b233-6b94-40f5-b3d2-fc1a89c13274 | NULL  | rabbit://nova:NZFUbsy2EPAQXh@cloudcontrol1003.wikimedia.org:5672,nova:NZFUbsy2EPAQXh@cloudcontrol1004.wikimedia.org:5672,nova:NZFUbsy2EPAQXh@cloudcontrol1005.wikimedia.org:5672 | mysql+pymysql://nova:z7sM97huea7sgH@openstack.eqiad1.wikimediacloud.org/nova_eqiad1       |        0 |
+---------------------+------------+----+--------------------------------------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------+----------+

(above passwords have been replaced, but the spirit of the issue remains)

After updating the nova api database, things seem to be working well. Remaining tasks are:

  1. Switch back to using service names (assuming that doesn't cause ipv6/ipv4/dns issues)
  2. Switch nova to ssl/tls

Change 821261 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] trove-guestagent.conf: standardize rabbitmq config

https://gerrit.wikimedia.org/r/821261

Change 821261 merged by Andrew Bogott:

[operations/puppet@production] trove-guestagent.conf: standardize rabbitmq config

https://gerrit.wikimedia.org/r/821261

Thanks for chiming in @ayounsi. Now that things are not urgently broken I have time to engage with your questions :)

Given the above, I would like to ask you for a bit more context on how this service works, for example:

  • How will it scale? (The current setup seems to show that the public IP need will scale linearly with the number of hosts.)

This service is unlikely to grow above three nodes. Rabbitmq does quorum checks that work best with an odd number of servers, so two isn't enough.

  • How will redundancy work?

Each rabbitmq server mirrors the others -- all messages should be processed identically on each instance, such that a client can connect to any server at any time.

  • Are clients pointed to a single host? Single IP, RR DNS, manual configuration change, manual DNS change, etc?

We're currently following the HA setup as recommended in all OpenStack documentation: each client has a list of all three server, and failovers are handled on the client side.

At this exact moment, the servers are identified by IP address, like so:

transport_url = rabbit://nova:<password>@208.80.154.147:5672,nova:<password>@208.80.154.73:5672,nova:<password>@208.80.155.102:5672

My preference would be to identify them via service names with cname records pointing to the actual servers:

transport_url = rabbit://nova:<password>@rabbitmq01.eqiad1.wikimediacloud.org:5672,nova:<password>@rabbitmq02.eqiad1.wikimediacloud.org:5672,nova:<password>@rabbitmq03.eqiad1.wikimediacloud.org:5672

However I was seeing inconsistent behavior between ipv6 and ipv4 connections so I'm currently forcing things to ipv4 only by using IPs. I'll open a subtask about that.

Currently a config change is ugly (a puppet patch + a manual mysql command) but once things are on service names it would require only a manual dns change.

  • What are the high-level network flows like?

Each rabbitmq node has a persistent connection to each other node (via port 25672) to maintain consistent state.

Each client maintains several connections with an arbitrarily-selected server via port 5672 (cleartext) or 5671 (tls). Because there are many many worker clients in our cluster, this adds up to hundreds of total connections.

Most clients run on cloudcontrol nodes (e.g. cloudcontrol1005.wikimedia.org) and some run on VMs running in cloud-vps.

You’ve mentioned that it isn't a great candidate for proxying.
By proxying, do you include LVS? If so, that shouldn’t add significant delay. Have you performed any tests by chance to verify that?

I'm not much worried about performance, rather I'm assuming that the many-long-lived-connections issue is outside the ideal use case of our setup. If that's wrong I'm happy to consider alternatives! I've also been recently warned off of using the existing lvs setup for cloud-vps applications because typically we present edge-cases that the production SREs are reluctant to support long-term; I've also recently been told not to use production LVS because 'everything is about to be redesigned' but perhaps that was only in reference to web services.

I'm happy to make use of existing LVS if that's appropriate and straightforward; I would however prefer to wait and have this conversation with Arturo as he's trying to narrow our use-cases and I'd hate to throw in a new one that he already has a plan for. This conversation strikes me as somewhat similar to the one for cloudswift (T296411) which did not converge on a simple 'just use LVS' conclusion but rather spun out into many different fantastical suggestions. Perhaps these are different cases and i'm missing the distinction.

Change 821311 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Openstack::nova and ::neutron: use service names for rabbit nodes

https://gerrit.wikimedia.org/r/821311

Change 821311 merged by Andrew Bogott:

[operations/puppet@production] Openstack::nova and ::neutron: use service names for rabbit nodes

https://gerrit.wikimedia.org/r/821311

Change 822133 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] role::wmcs::openstack::eqiad1::control: remove rabbitmq

https://gerrit.wikimedia.org/r/822133

Change 822133 merged by Andrew Bogott:

[operations/puppet@production] role::wmcs::openstack::eqiad1::control: remove rabbitmq

https://gerrit.wikimedia.org/r/822133

Thanks for the explanations! I think it would be nice to have them on Wikitech to find them more easily in the future.

Based on what you said, (for example that the clients are only CloudVPS and cloudcontrol hosts) LVS indeed isn't ideal.
Both public vlan hosts (current setup) and LVS hosts are to expose prod services/data to the outside world (including WMCS), see case 2

As it's a service for WMCS only, case 4 is the ideal path, with the added benefit of less hops from client to servers (for more resilient and better latency)

Once that case 4 is production ready (see T314847 and T297587) relevant services should be migrated over.