Galera nodes seem unable to sync properly in codfw1dev once one of them falls behind and tries to catch up. The issue is firewall rules and confusion between internal and private IPs.
For example, on 2005-dev, the clustering config in 50-server.cnf looks like this:
wsrep_on=ON wsrep_cluster_address="gcomm://172.20.5.5,172.20.5.6,172.20.5.7" wsrep_cluster_name="openstack" wsrep_provider=/usr/lib/galera/libgalera_smm.so wsrep_sst_method=rsync wsrep_slave_threads=24 # 4x CPU is recommended, but CPU is heavily used by rabbit # Because we are using this in primary/backup mode, reduce restrictions on replication wsrep_provider_options="gcs.fc_limit = 256; gcs.fc_factor = 0.99; gcs.fc_master_slave = yes" wsrep_node_address="10.192.20.24" wsrep_node_name="cloudcontrol2005-dev.codfw.wmnet"
When this node comes up it tries to sync up with cloudcontrol2001-dev but traffic hits 2001-dev from the 10. IP, which is not permitted by the firewall. If I open up the firewall to 10. cloudcontrol addresses things work properly, but I suspect we'd rather that traffic happen between 172 addresses.
Likely this would also work if we change wsrep_node_address to 172.20.5.7 and wsrep_node_name to cloudcontrol2005-dev.private.codfw.wikimedia.cloud. Does this seem like the right change? And if so, how do we find that name and address in puppet?