galera sync failures on cloudcontrol2005-dev
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Andrew
	Jun 29 2023, 8:41 PM

Description

Galera nodes seem unable to sync properly in codfw1dev once one of them falls behind and tries to catch up. The issue is firewall rules and confusion between internal and private IPs.

For example, on 2005-dev, the clustering config in 50-server.cnf looks like this:

wsrep_on=ON
wsrep_cluster_address="gcomm://172.20.5.5,172.20.5.6,172.20.5.7"
wsrep_cluster_name="openstack"
wsrep_provider=/usr/lib/galera/libgalera_smm.so
wsrep_sst_method=rsync
wsrep_slave_threads=24  # 4x CPU is recommended, but CPU is heavily used by rabbit
# Because we are using this in primary/backup mode, reduce restrictions on replication
wsrep_provider_options="gcs.fc_limit = 256; gcs.fc_factor = 0.99; gcs.fc_master_slave = yes"

wsrep_node_address="10.192.20.24"
wsrep_node_name="cloudcontrol2005-dev.codfw.wmnet"

When this node comes up it tries to sync up with cloudcontrol2001-dev but traffic hits 2001-dev from the 10. IP, which is not permitted by the firewall. If I open up the firewall to 10. cloudcontrol addresses things work properly, but I suspect we'd rather that traffic happen between 172 addresses.

Likely this would also work if we change wsrep_node_address to 172.20.5.7 and wsrep_node_name to cloudcontrol2005-dev.private.codfw.wikimedia.cloud. Does this seem like the right change? And if so, how do we find that name and address in puppet?

Details

	Subject	Repo	Branch	Lines +/-
	galera: allow to set a different local node name / address	operations/puppet	production	+26 -16

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	aborrero	T296411 cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet
Resolved	aborrero	T297596 have cloud hardware servers in the cloud realm using a dedicated LB layer
Resolved	aborrero	T324992 cloudlb: create PoC on codfw
Resolved	aborrero	T332153 cloudlb PoC: prepare backends
Resolved	Andrew	T340791 galera sync failures on cloudcontrol2005-dev

Event Timeline

Andrew created this task.Jun 29 2023, 8:41 PM

Restricted Application edited projects, added cloud-services-team; removed cloud-services-team (FY2022/2023-Q4). · View Herald TranscriptJun 29 2023, 8:41 PM

aborrero added a project: User-aborrero.Jun 30 2023, 8:30 AM

aborrero moved this task from Backlog to Doing on the User-aborrero board.Jun 30 2023, 8:34 AM

Change 934493 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] galera: allow to set a different local node name / address

https://gerrit.wikimedia.org/r/934493

gerritbot added a project: Patch-For-Review.Jun 30 2023, 9:42 AM

Change 934493 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] galera: allow to set a different local node name / address

https://gerrit.wikimedia.org/r/934493

Maintenance_bot removed a project: Patch-For-Review.Jun 30 2023, 11:10 AM

After merging the change, I restarted mariadb on all 3 cloudcontrols via cumin and they are now refusing to start:

Jun 30 11:21:35 cloudcontrol2001-dev mariadbd[3100805]: 2023-06-30 11:21:35 0 [ERROR] WSREP: ./gcs/src/gcs_core.cpp:gcs_core_open():221: Failed to open backend connection: -110 (Connection timed out)
Jun 30 11:21:35 cloudcontrol2001-dev mariadbd[3100805]: 2023-06-30 11:21:35 0 [ERROR] WSREP: ./gcs/src/gcs.cpp:gcs_open():1669: Failed to open channel 'openstack' at 'gcomm://172.20.5.5,172.20.5.6,172.20.5.7': -1>
Jun 30 11:21:35 cloudcontrol2001-dev mariadbd[3100805]: 2023-06-30 11:21:35 0 [ERROR] WSREP: gcs connect failed: Connection timed out
Jun 30 11:21:35 cloudcontrol2001-dev mariadbd[3100805]: 2023-06-30 11:21:35 0 [ERROR] WSREP: wsrep::connect(gcomm://172.20.5.5,172.20.5.6,172.20.5.7) failed: 7
Jun 30 11:21:35 cloudcontrol2001-dev mariadbd[3100805]: 2023-06-30 11:21:35 0 [ERROR] Aborting

Followed https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Galera_won%27t_start_up and the galera cluster is back online.

Could you please check @Andrew if things are better now?

aborrero moved this task from Doing to Radar/observer on the User-aborrero board.Jun 30 2023, 3:25 PM

The pool has failed at least once since you opened this ticket, but right now all three nodes show as ready and as part of a 3-node cluster.

galera sync failures on cloudcontrol2005-devClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

galera sync failures on cloudcontrol2005-dev
Closed, ResolvedPublic
Actions

Related Objects
Search...