Page MenuHomePhabricator

galera sync failures on cloudcontrol2005-dev
Closed, ResolvedPublic

Description

Galera nodes seem unable to sync properly in codfw1dev once one of them falls behind and tries to catch up. The issue is firewall rules and confusion between internal and private IPs.

For example, on 2005-dev, the clustering config in 50-server.cnf looks like this:

wsrep_on=ON
wsrep_cluster_address="gcomm://172.20.5.5,172.20.5.6,172.20.5.7"
wsrep_cluster_name="openstack"
wsrep_provider=/usr/lib/galera/libgalera_smm.so
wsrep_sst_method=rsync
wsrep_slave_threads=24  # 4x CPU is recommended, but CPU is heavily used by rabbit
# Because we are using this in primary/backup mode, reduce restrictions on replication
wsrep_provider_options="gcs.fc_limit = 256; gcs.fc_factor = 0.99; gcs.fc_master_slave = yes"

wsrep_node_address="10.192.20.24"
wsrep_node_name="cloudcontrol2005-dev.codfw.wmnet"

When this node comes up it tries to sync up with cloudcontrol2001-dev but traffic hits 2001-dev from the 10. IP, which is not permitted by the firewall. If I open up the firewall to 10. cloudcontrol addresses things work properly, but I suspect we'd rather that traffic happen between 172 addresses.

Likely this would also work if we change wsrep_node_address to 172.20.5.7 and wsrep_node_name to cloudcontrol2005-dev.private.codfw.wikimedia.cloud. Does this seem like the right change? And if so, how do we find that name and address in puppet?

Event Timeline

Change 934493 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] galera: allow to set a different local node name / address

https://gerrit.wikimedia.org/r/934493

Change 934493 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] galera: allow to set a different local node name / address

https://gerrit.wikimedia.org/r/934493

After merging the change, I restarted mariadb on all 3 cloudcontrols via cumin and they are now refusing to start:

Jun 30 11:21:35 cloudcontrol2001-dev mariadbd[3100805]: 2023-06-30 11:21:35 0 [ERROR] WSREP: ./gcs/src/gcs_core.cpp:gcs_core_open():221: Failed to open backend connection: -110 (Connection timed out)
Jun 30 11:21:35 cloudcontrol2001-dev mariadbd[3100805]: 2023-06-30 11:21:35 0 [ERROR] WSREP: ./gcs/src/gcs.cpp:gcs_open():1669: Failed to open channel 'openstack' at 'gcomm://172.20.5.5,172.20.5.6,172.20.5.7': -1>
Jun 30 11:21:35 cloudcontrol2001-dev mariadbd[3100805]: 2023-06-30 11:21:35 0 [ERROR] WSREP: gcs connect failed: Connection timed out
Jun 30 11:21:35 cloudcontrol2001-dev mariadbd[3100805]: 2023-06-30 11:21:35 0 [ERROR] WSREP: wsrep::connect(gcomm://172.20.5.5,172.20.5.6,172.20.5.7) failed: 7
Jun 30 11:21:35 cloudcontrol2001-dev mariadbd[3100805]: 2023-06-30 11:21:35 0 [ERROR] Aborting

Could you please check @Andrew if things are better now?

The pool has failed at least once since you opened this ticket, but right now all three nodes show as ready and as part of a 3-node cluster.