Page MenuHomePhabricator

Galera on cloudcontrol1004 going out of sync
Closed, ResolvedPublic

Description

Every few hours cloudcontrol1004 galera is getting out of sync and depooling itself.

I don't know exactly what's happening but it started happening as soon as we split the cloud control cluster between Buster and Bullseye (which is also different mariadb versions). I suspect this will stop happening when we get everything moved to Bullseye.

In the meantime, it seems to help to stop/start the corrupted service, but it might well recover on its own anyway.

Event Timeline

This happened again today:

MariaDB [(none)]> SHOW STATUS LIKE "wsrep_local_state_comment";
+---------------------------+--------------+
| Variable_name             | Value        |
+---------------------------+--------------+
| wsrep_local_state_comment | Inconsistent |
+---------------------------+--------------+
1 row in set (0.001 sec)

MariaDB [(none)]> SHOW STATUS LIKE "wsrep_ready";
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_ready   | OFF   |
+---------------+-------+
1 row in set (0.001 sec)

Let me restart the mariadb service.

Mentioned in SAL (#wikimedia-cloud) [2022-02-22T09:13:22Z] <arturo> restarting mariadb @ cloudcontrol1004 (T302146)

aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

After the restart:

MariaDB [(none)]> SHOW STATUS LIKE "wsrep_local_state_comment";
+---------------------------+--------+
| Variable_name             | Value  |
+---------------------------+--------+
| wsrep_local_state_comment | Synced |
+---------------------------+--------+
1 row in set (0.001 sec)

MariaDB [(none)]> SHOW STATUS LIKE "wsrep_ready";
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_ready   | ON    |
+---------------+-------+
1 row in set (0.001 sec)

Will keep an eye on this.

Mentioned in SAL (#wikimedia-cloud) [2022-02-22T09:24:29Z] <arturo> restarting mariadb @ cloudcontrol1003 (T302146)

shortly after, cloudcontrol1003 showed the same:

aborrero@cloudcontrol1003:~ 46s $ sudo mysql -u root
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 186405
Server version: 10.5.12-MariaDB-0+deb11u1-log Debian 11

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> SHOW STATUS LIKE "wsrep_local_state_comment";
+---------------------------+--------------+
| Variable_name             | Value        |
+---------------------------+--------------+
| wsrep_local_state_comment | Inconsistent |
+---------------------------+--------------+
1 row in set (0.001 sec)

MariaDB [(none)]> SHOW STATUS LIKE "wsrep_ready";
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_ready   | OFF   |
+---------------+-------+
1 row in set (0.001 sec)

and I did the same, restart the mariadb service:

aborrero@cloudcontrol1003:~ 8s $ sudo systemctl restart mariadb

and then:

MariaDB [(none)]> SHOW STATUS LIKE "wsrep_ready";
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_ready   | ON    |
+---------------+-------+
1 row in set (0.001 sec)

MariaDB [(none)]> SHOW STATUS LIKE "wsrep_local_state_comment";
+---------------------------+--------+
| Variable_name             | Value  |
+---------------------------+--------+
| wsrep_local_state_comment | Synced |
+---------------------------+--------+
1 row in set (0.001 sec)

Could there be some data bug circulating between the galera nodes?

Note that this last round of alerts happened after all three nodes were running the same mariadb version: 1:10.5.12-0+deb11u1. The only straw left for me to grasp is that 1004 doesn't have a proper, fresh reimage with Bullseye yet so /maybe/ there's something hanging around there that's messing with us.

Current status:
cloudcontrol1004 -> mysql is stopped
cloudcontrol1003 -> sees the cluster size at 2, itself as primary, and everything synced and running

mysql:root@localhost [(none)]> show global status like 'wsrep_%';
...
| wsrep_local_state_uuid        | 2c772340-9361-11ec-9f39-77170428a726                                                                                                           |
...
| wsrep_last_committed          | 747314                                                                                                                                         |
| wsrep_replicated              | 335                                                                                                                                            |
...
| wsrep_local_state             | 4                                                                                                                                              |
| wsrep_local_state_comment     | Synced                                                                                                                                         |
...
| wsrep_incoming_addresses      | AUTO,AUTO                                                                                                                                      |
...
| wsrep_evs_state               | OPERATIONAL                                                                                                                                    |
| wsrep_cluster_conf_id         | 17                                                                                                                                             |
...
| wsrep_cluster_size            | 2                                                                                                                                              |
| wsrep_cluster_state_uuid      | 2c772340-9361-11ec-9f39-77170428a726                                                                                                           |
| wsrep_cluster_status          | Primary                                                                                                                                        |
| wsrep_connected               | ON                                                                                                                                             |
...
| wsrep_ready                   | ON                                                                                                                                             |
...

cloudcontrol1005 -> sees the cluster size as 3, itself as primary and everything running ok:

mysql:root@localhost [information_schema]> SHOW GLOBAL STATUS LIKE 'wsrep_%';
...
| wsrep_local_state_uuid        | 2c772340-9361-11ec-9f39-77170428a726                                                                                                           |
...
| wsrep_last_committed          | 746881                                                                                                                                         |
| wsrep_replicated              | 549500                                                                                                                                         |
...
| wsrep_local_state_comment     | Synced                                                                                                                                         |
...
| wsrep_incoming_addresses      | AUTO,AUTO,AUTO                                                                                                                                 |
...
| wsrep_evs_state               | OPERATIONAL                                                                                                                                    |
| wsrep_cluster_conf_id         | 16                                                                                                                                             |
...
| wsrep_cluster_size            | 3                                                                                                                                              |
| wsrep_cluster_state_uuid      | 2c772340-9361-11ec-9f39-77170428a726                                                                                                           |
| wsrep_cluster_status          | Primary                                                                                                                                        |
| wsrep_connected               | ON                                                                                                                                             |
...
| wsrep_ready                   | ON                                                                                                                                             |
...

That seems to point out that the cluster is partitioned :S

Hmm, I see this in the logs:
cloudcontrol1005> journalct -u mariadb

...
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]: 2022-02-22 18:38:51 0 [Note] WSREP: declaring 3b306ca0-b063 at tcp://208.80.154.23:4567 stable
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]: 2022-02-22 18:38:51 0 [Note] WSREP: forgetting a1eaf3a5-9bfd (tcp://208.80.154.132:4567)
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]: 2022-02-22 18:38:51 0 [Note] WSREP: Node 3b306ca0-b063 state prim
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]: 2022-02-22 18:38:51 0 [Note] WSREP: view(view_id(PRIM,3b306ca0-b063,624) memb {
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]:         3b306ca0-b063,0
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]:         42baed89-b710,0
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]: } joined {
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]: } left {
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]: } partitioned {
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]:         a1eaf3a5-9bfd,0
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]: })
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]: 2022-02-22 18:38:51 0 [Note] WSREP: save pc into disk
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]: 2022-02-22 18:38:51 0 [Note] WSREP: forgetting a1eaf3a5-9bfd (tcp://208.80.154.132:4567)
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]: 2022-02-22 18:38:51 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]: 2022-02-22 18:38:51 0 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]: 2022-02-22 18:38:51 0 [Note] WSREP: STATE EXCHANGE: sent state msg: ae3c22fb-940e-11ec-b633-47de893d11d2
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]: 2022-02-22 18:38:51 0 [Note] WSREP: STATE EXCHANGE: got state msg: ae3c22fb-940e-11ec-b633-47de893d11d2 from 0 (cloudcontrol1003.wikimedia.org)
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]: 2022-02-22 18:38:51 0 [Note] WSREP: STATE EXCHANGE: got state msg: ae3c22fb-940e-11ec-b633-47de893d11d2 from 1 (cloudcontrol1005.wikimedia.org)
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]: 2022-02-22 18:38:51 0 [Note] WSREP: Quorum results:
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]:         version    = 6,
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]:         component  = PRIMARY,
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]:         conf_id    = 16,
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]:         members    = 2/2 (joined/total),
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]:         act_id     = 747300,
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]:         last_appl. = 746804,
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]:         protocols  = 2/10/4 (gcs/repl/appl),
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]:         vote policy= 0,
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]:         group UUID = 2c772340-9361-11ec-9f39-77170428a726
Feb 22 18:38:51 cloudcontrol1005 mariadbd[2823]: 2022-02-22 18:38:51 0 [Note] WSREP: Flow-control interval: [253, 256]
Feb 22 18:38:56 cloudcontrol1005 mariadbd[2823]: 2022-02-22 18:38:56 0 [Note] WSREP:  cleaning up a1eaf3a5-9bfd (tcp://208.80.154.132:4567)

Pointing that it sees only 2 members, but the status says 3, so seems that it's failing to update it's own status?

Another thing I saw, is that the docs say:
note that innodb_buffer_pool_instances is ignored from MariaDB 10.5, and removed in MariaDB 10.6, as the buffer pool is no longer split into multiple instances)

And Majavah pointed out that there's a single log in the logs right around the time the issues started with:

Feb 22 17:06:11 cloudcontrol1005 mariadbd[2823]: 2022-02-22 17:06:11 310880 [Note] InnoDB: Number of pools: 2

all the previous logs said "Number of pools: 1" instead

so might look into that too.

According to https://galeracluster.com/library/training/tutorials/galera-monitoring.html this value means that the server is struggling to keep up with the incoming write-sets:

| wsrep_local_recv_queue_avg    | 3.98482                                                                                                                                        |

Pointing also to some issue/bottleneck writing to disk

Hmmm... and flush status seems to hang on cloudcontrol1005 and cloudcontrtol1003 both, though as far as I know it should be pretty fast.

fio tests on /srv:

cloudcontrol1003:

Run status group 0 (all jobs):
   READ: bw=1026KiB/s (1051kB/s), 1026KiB/s-1026KiB/s (1051kB/s-1051kB/s), io=60.1MiB (63.1MB), run=60015-60015msec
  WRITE: bw=1057KiB/s (1083kB/s), 1057KiB/s-1057KiB/s (1083kB/s-1083kB/s), io=61.0MiB (64.0MB), run=60015-60015msec

Disk stats (read/write):
    dm-2: ios=15394/15907, merge=0/0, ticks=685320/1254072, in_queue=1939392, util=89.95%, aggrios=15394/18776, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
    md0: ios=15394/18776, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=4171/8949, aggrmerge=307/573, aggrticks=157779/413761, aggrin_queue=571541, aggrutil=89.22%
  sdd: ios=6847/8883, merge=307/554, ticks=278732/603889, in_queue=882622, util=89.22%
  sdc: ios=1575/8875, merge=306/556, ticks=49205/289822, in_queue=339028, util=61.59%
  sdb: ios=1892/9019, merge=307/592, ticks=66802/320207, in_queue=387008, util=62.73%
  sda: ios=6370/9019, merge=309/592, ticks=236380/441127, in_queue=677506, util=77.51%

cloudcontrol1004:

Run status group 0 (all jobs):
   READ: bw=1144KiB/s (1172kB/s), 1144KiB/s-1144KiB/s (1172kB/s-1172kB/s), io=67.2MiB (70.5MB), run=60173-60173msec
  WRITE: bw=1174KiB/s (1202kB/s), 1174KiB/s-1174KiB/s (1202kB/s-1202kB/s), io=68.0MiB (72.3MB), run=60173-60173msec

Disk stats (read/write):
    dm-2: ios=17215/17706, merge=0/0, ticks=507248/1477752, in_queue=1985000, util=98.91%, aggrios=17215/20167, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
    md0: ios=17215/20167, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=4304/9996, aggrmerge=0/163, aggrticks=126588/512548, aggrin_queue=639137, aggrutil=96.39%
  sdd: ios=4074/9964, merge=0/187, ticks=119639/493258, in_queue=612897, util=93.53%
  sdc: ios=4626/9968, merge=0/183, ticks=133861/525053, in_queue=658914, util=95.60%
  sdb: ios=3792/10024, merge=0/146, ticks=114140/478394, in_queue=592533, util=93.39%
  sda: ios=4727/10031, merge=0/139, ticks=138714/553490, in_queue=692204, util=96.39%

cloudcontrol1005:

Run status group 0 (all jobs):
   READ: bw=189MiB/s (198MB/s), 189MiB/s-189MiB/s (198MB/s-198MB/s), io=11.1GiB (11.9GB), run=60002-60002msec
  WRITE: bw=189MiB/s (198MB/s), 189MiB/s-189MiB/s (198MB/s-198MB/s), io=11.1GiB (11.9GB), run=60002-60002msec

Disk stats (read/write):
    dm-2: ios=2898286/2897785, merge=0/0, ticks=2739356/2116568, in_queue=4855924, util=99.96%, aggrios=2907364/2910607, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
    md0: ios=2907364/2910607, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=729971/1454983, aggrmerge=1145/405, aggrticks=540141/745663, aggrin_queue=1285804, aggrutil=99.94%
  sdd: ios=747996/1455205, merge=1144/376, ticks=556469/750048, in_queue=1306517, util=99.94%
  sdc: ios=712061/1455216, merge=1142/365, ticks=526602/746259, in_queue=1272862, util=99.93%
  sdb: ios=710683/1454759, merge=1143/438, ticks=524357/741483, in_queue=1265840, util=99.94%
  sda: ios=749147/1454755, merge=1153/442, ticks=553137/744862, in_queue=1297998, util=99.93%

No shock there, 1005 has ssds so is much speedier.

I've rebuilt and re-synced cloudcontrol1004; the average sync time is creeping downwards. We'll see where it lands.

This doesn't fit the behavior I'm seeing perfectly but seems nevertheless important:

https://www.mail-archive.com/debian-bugs-dist@lists.debian.org/msg1839108.html

I have disabled puppet on 1004 and also stopped mariadb. Going to leave that for a few days, for science. If it turns out that everything works like that, then we have to figure out if the problem is really 1004 or if it's inherent with three nodes.

In theory it's ok to only have two nodes since haproxy is pointing all traffic at 1005 anyway.

Mariadb10.5.15 can be found here: https://mariadb.org/download/?t=repo-config&d=Debian+11+%22Bullseye%22&v=10.5&r_m=xtom_fre

Probably worth upgrading to that although I'm not sure I want to dive in right before going on vacation.

Mentioned in SAL (#wikimedia-cloud) [2022-02-24T15:42:56Z] <dcaro> stopping and starting mariadb on cloudcontrol1003 (T302146)

The new package upgrades are behaving well, so far. I remain hopeful that this has fixed everything but am still holding my breath.

So far so good, I was tunning a bit the dashboard (still a lot to go), but the lags are going down bit by bit, and no issues so far, replication still works (created a DB on each node, to see it replicated to the others).
Will let it run through the weekend.

aborrero lowered the priority of this task from High to Medium.Mar 1 2022, 9:53 AM

The cluster seems stable now, I'll close this task and open a new one in case it starts misbehaving again.

dcaro claimed this task.