cloudcontrol1004 galera crash
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Bstorm
	Sep 20 2021, 11:32 PM

Description

Alert came in for haproxy "failover" on the Openstack Galera cluster

Notification Type: PROBLEM

Service: WMCS Galera Cluster
Host: cloudcontrol1004
Address: 208.80.154.132
State: CRITICAL

Date/Time: Mon Sept 20 22:40:17 UTC 2021

Notes URLs: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting

Acknowledged by : 

Additional Info:

Error during connection: Cant connect to MySQL server on 208.80.154.132 (115)

The mariadb server crashed.

Sep 20 22:34:07 cloudcontrol1004 mysqld[31114]: 2021-09-20 22:34:07 0 [ERROR] WSREP: Trx 343746060 tries to abort slave trx 343746066. This could be caused by:
Sep 20 22:34:07 cloudcontrol1004 mysqld[31114]:         1) unsupported configuration options combination, please check documentation.
Sep 20 22:34:07 cloudcontrol1004 mysqld[31114]:         2) a bug in the code.
Sep 20 22:34:07 cloudcontrol1004 mysqld[31114]:         3) a database corruption.

Details

	Subject	Repo	Branch	Lines +/-
	openstack: trove: enable service by default	operations/puppet	production	+3 -0

Customize query in gerrit

Event Timeline

• Bstorm created this task.Sep 20 2021, 11:32 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 20 2021, 11:32 PM

At this point we've stopped the node and ACK'd alerts.

dmesg has lovely bits of hex code if anyone wants to interpret that, but it overall doesn't look good for the storage unless it was a bug in the DB.

echo check > /sys/block/md0/md/sync_action is running, and we can check progress on the RAID check with:

root@cloudcontrol1004:~# cat /proc/mdstat
Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4]
md0 : active raid10 sda2[0] sdb2[1] sdc2[2] sdd2[3]
      7813185536 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      [>....................]  check =  0.3% (25240064/7813185536) finish=6022.1min speed=21552K/sec
      bitmap: 1/59 pages [4KB], 65536KB chunk

When that finishes, we probably should sync the node back up with the cluster. We are using rsync as the SST method, so it doesn't need a functional database to resync. If simply starting it after a long wait doesn't trigger it, we can probably declare a donor https://www.globo.tech/learning-center/how-to-add-a-new-node-to-a-galera-replication-cluster-on-ubuntu-14-04-lts/

Probably best if the donor is not the primary node so that it isn't the one everything is talking to? We'll find out tomorrow when that RAID check finishes.

When we get this back running, this should probably end up as a runbook.

The raid progress is very slow:

aborrero@cloudcontrol1004:~ $ cat /proc/mdstat
Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] 
md0 : active raid10 sda2[0] sdb2[1] sdc2[2] sdd2[3]
      7813185536 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      [=>...................]  check =  9.6% (757275520/7813185536) finish=5593.4min speed=21024K/sec
      bitmap: 1/59 pages [4KB], 65536KB chunk

unused devices: <none>

finish=5593.4min

That's a bit less than 4 days, I guess the disk is being used heavily.

progress today:

aborrero@cloudcontrol1004:~ $ cat /proc/mdstat
Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] 
md0 : active raid10 sda2[0] sdb2[1] sdc2[2] sdd2[3]
      7813185536 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      [=======>.............]  check = 38.4% (3001865088/7813185536) finish=3712.8min speed=21597K/sec
      bitmap: 1/59 pages [4KB], 65536KB chunk

unused devices: <none>

it seems it finished

aborrero@cloudcontrol1004:~ $ cat /proc/mdstat
Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] 
md0 : active raid10 sda2[0] sdb2[1] sdc2[2] sdd2[3]
      7813185536 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 1/59 pages [4KB], 65536KB chunk

unused devices: <none>

Host rebooted by aborrero@cumin1001 with reason: RAID check

Mentioned in SAL (#wikimedia-cloud) [2021-09-27T09:24:58Z] <arturo> rebooting cloudcontrol1004 for T291446

mysql:root@localhost [(none)]> SHOW STATUS LIKE "wsrep_ready";
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_ready   | ON    |
+---------------+-------+
1 row in set (0.001 sec)

mysql:root@localhost [(none)]> SHOW STATUS LIKE "wsrep_local_state_comment";
+---------------------------+--------+
| Variable_name             | Value  |
+---------------------------+--------+
| wsrep_local_state_comment | Synced |
+---------------------------+--------+
1 row in set (0.001 sec)

In T291446#7367412, @Bstorm wrote:

When that finishes, we probably should sync the node back up with the cluster. We are using rsync as the SST method, so it doesn't need a functional database to resync. If simply starting it after a long wait doesn't trigger it, we can probably declare a donor https://www.globo.tech/learning-center/how-to-add-a-new-node-to-a-galera-replication-cluster-on-ubuntu-14-04-lts/

Probably best if the donor is not the primary node so that it isn't the one everything is talking to? We'll find out tomorrow when that RAID check finishes.

apparently we didn't need this. The DB synced itself somehow.

Change 724008 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: trove: enable service by default

https://gerrit.wikimedia.org/r/724008

gerritbot added a project: Patch-For-Review.Sep 27 2021, 9:56 AM

Change 724008 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: trove: enable service by default

https://gerrit.wikimedia.org/r/724008

aborrero closed this task as Resolved.Sep 27 2021, 10:05 AM

aborrero claimed this task.

Mentioned in SAL (#wikimedia-cloud) [2021-09-27T10:07:43Z] <arturo> cloudcontrol1004 apparently healthy T291446

Maintenance_bot removed a project: Patch-For-Review.Sep 27 2021, 10:10 AM