Page MenuHomePhabricator

Active/active rabbitMQ servers on wmcs controller nodes
Closed, ResolvedPublic

Description

The OpenStack HA guide claims that we can run two active rabbitmq servers and client services will failover sensibly (if imperfectly) between them. Let's try it!

https://docs.openstack.org/ha-guide/control-plane-stateful.html#messaging-service-for-high-availability

Event Timeline

Andrew created this task.May 20 2019, 2:13 PM

Change 511599 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] rabbitmq: set erlang_cookie for cloud deploys

https://gerrit.wikimedia.org/r/511599

Change 511601 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[labs/private@master] rabbit: add dummy erlang cookies + others

https://gerrit.wikimedia.org/r/511601

Change 511601 merged by Andrew Bogott:
[labs/private@master] rabbit: add dummy erlang cookies + others

https://gerrit.wikimedia.org/r/511601

Change 511599 merged by Andrew Bogott:
[operations/puppet@production] rabbitmq: set erlang_cookie for cloud deploys

https://gerrit.wikimedia.org/r/511599

Change 511619 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] rabbitmq: allow epmd access from secondary rabbitmq server

https://gerrit.wikimedia.org/r/511619

Change 511619 merged by Andrew Bogott:
[operations/puppet@production] rabbitmq: allow epmd access from secondary rabbitmq server

https://gerrit.wikimedia.org/r/511619

Change 511620 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] rabbitmq: open port 25672 for clustering

https://gerrit.wikimedia.org/r/511620

Change 511620 merged by Andrew Bogott:
[operations/puppet@production] rabbitmq: open port 25672 for clustering

https://gerrit.wikimedia.org/r/511620

Change 511727 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova: support primary/secondary rabbitmq hosts

https://gerrit.wikimedia.org/r/511727

Change 511744 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] neutron: support primary/secondary rabbitmq hosts

https://gerrit.wikimedia.org/r/511744

Change 511745 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] designate: support primary/secondary rabbitmq hosts

https://gerrit.wikimedia.org/r/511745

Change 511727 merged by Andrew Bogott:
[operations/puppet@production] nova: support primary/secondary rabbitmq hosts

https://gerrit.wikimedia.org/r/511727

Change 511745 merged by Andrew Bogott:
[operations/puppet@production] designate: support primary/secondary rabbitmq hosts

https://gerrit.wikimedia.org/r/511745

Change 511744 merged by Andrew Bogott:
[operations/puppet@production] neutron: support primary/secondary rabbitmq hosts

https://gerrit.wikimedia.org/r/511744

Andrew closed this task as Resolved.May 22 2019, 3:06 AM

I just did a fail-over test in production, and still able to create new VMs with rabbit stopped on either cloudvirt1003 or cloudvirt1004.

Change 513150 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] rabbitmq: open firewalls for rabbit communication both ways

https://gerrit.wikimedia.org/r/513150

Change 513150 merged by Andrew Bogott:
[operations/puppet@production] rabbitmq: open firewalls for rabbit communication both ways

https://gerrit.wikimedia.org/r/513150

Both rabbitMQ servers should be disk nodes. Currently cloudcontrol1004 is a RAM node.
{nodes,[{disc,[rabbit@cloudcontrol1003]},{ram,[rabbit@cloudcontrol1004]

This puts us in a vulnerable position where cloudcontrol1004 is dependent on cloudcontrol1003 to function correctly.

Per the rabbitmq clustering guide[0]

Since RAM nodes store internal database tables in RAM only, they must sync them from a peer node 
on startup. This means that a cluster must contain at least one disk node. It is therefore not possible to 
manually remove the last remaining disk node in a cluster.

I was really surprised to see that the openstack HA guide recommends the second host be a RAM node. There was a bug[1] report for this awhile back, which was merged but it's not in the guide still.

[0] https://www.rabbitmq.com/clustering.html
[1] https://bugs.launchpad.net/openstack-manuals/+bug/1744647

yep, I definitely just did what the HA guide said to do :/ If we can do active/active with two disk nodes that seems fine!

btw, if you update the live config please adjust the docs here, accordingly: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Rabbitmq

This has been completed:

cloudcontrol1004:~# sudo rabbitmqctl cluster_status                                                                                                                                                                       
Cluster status of node rabbit@cloudcontrol1004 ...                                                                                                                                                                            
[{nodes,[{disc,[rabbit@cloudcontrol1003]},{ram,[rabbit@cloudcontrol1004]}]},                                                                                                                                                  
 {running_nodes,[rabbit@cloudcontrol1003,rabbit@cloudcontrol1004]},                                                                                                                                                           
 {cluster_name,<<"rabbit@cloudcontrol1003.wikimedia.org">>},                                                                                                                                                                  
 {partitions,[]},                                                                                                                                                                                                             
 {alarms,[{rabbit@cloudcontrol1003,[]},{rabbit@cloudcontrol1004,[]}]}]   

cloudcontrol1004:~# rabbitmqctl stop_app
Stopping node rabbit@cloudcontrol1004 ...

cloudcontrol1004:~# rabbitmqctl change_cluster_node_type disc
Turning rabbit@cloudcontrol1004 into a disc node ...

cloudcontrol1004:~# rabbitmqctl cluster_status
Cluster status of node rabbit@cloudcontrol1004 ...
[{nodes,[{disc,[rabbit@cloudcontrol1003,rabbit@cloudcontrol1004]}]},
 {alarms,[{rabbit@cloudcontrol1003,[]}]}]

cloudcontrol1004:~# rabbitmqctl start_app
Starting node rabbit@cloudcontrol1004 ...

cloudcontrol1004:~# rabbitmqctl cluster_status
Cluster status of node rabbit@cloudcontrol1004 ...
[{nodes,[{disc,[rabbit@cloudcontrol1003,rabbit@cloudcontrol1004]}]},
 {running_nodes,[rabbit@cloudcontrol1003,rabbit@cloudcontrol1004]},
 {cluster_name,<<"rabbit@cloudcontrol1003.wikimedia.org">>},
 {partitions,[]},
 {alarms,[{rabbit@cloudcontrol1003,[]},{rabbit@cloudcontrol1004,[]}]}]