Page MenuHomePhabricator

[toolsdb] Enable gtid to help replication recovery
Closed, ResolvedPublic

Description

Enabling Global Transaction ID (https://dev.mysql.com/doc/refman/5.6/en/replication-gtids-concepts.html) should help avoiding having corrupted binary logs and being able to restart the replication process from the exact moment that the secondary stopped, avoiding the drift in the database data caused by skipping binary log entries.

This task is to:

  • Investigate if this was tried before (seems to be the case) and why it was stopped
  • Investigate how to set it up
  • Set it up

Event Timeline

dcaro triaged this task as High priority.Feb 17 2022, 3:20 PM
dcaro created this task.

Change 789588 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] toolsdb: add gtid_domain_id

https://gerrit.wikimedia.org/r/789588

Note MariaDB and MySQL GTID implementations are different, and the correct documentation is https://mariadb.com/kb/en/gtid/

In theory this would be as simple as

STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=slave_pos; START SLAVE;

We don't strictly need gtid_domain_id set before, but I don't see any reason to not set it either, so uploaded a patch to set it like any other production database host. I think we can set that with SET GLOBAL gtid_domain_id without restarting the primary as Puppet changing the config file won't restart the service.

Change 789588 merged by David Caro:

[operations/puppet@production] toolsdb: add gtid_domain_id

https://gerrit.wikimedia.org/r/789588

This is clouddb1002 (secondary) after setting gtid_domain_id on both servers:

MariaDB [(none)]> SELECT @@GLOBAL.gtid_slave_pos;
+------------------------------------------------------+
| @@GLOBAL.gtid_slave_pos                              |
+------------------------------------------------------+
| 0-2886731673-33519859088,2886731673-2886731673-18688 |
+------------------------------------------------------+
1 row in set (0.00 sec)

Is it intentional that there are two entries and the first one starts with 0-?

taavi claimed this task.

Looks good:

MariaDB [(none)]> show slave status\G
[...]
                  Using_Gtid: No
                  Gtid_IO_Pos:
[...]

MariaDB [(none)]> STOP SLAVE;
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> CHANGE MASTER TO MASTER_USE_GTID=slave_pos;
Query OK, 0 rows affected (0.02 sec)

MariaDB [(none)]> START SLAVE;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> show slave status\G
[...]
                  Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 0-2886731673-33520905397,2886731673-2886731673-13720920
[...]

And verified manually that updated rows are getting replicated.