Page MenuHomePhabricator

ToolsDB: discard obsolete GTID domains
Closed, ResolvedPublic

Description

The GTID in the ToolsDB servers (tools-db-1 and tools-db-2) contains two obsolete domain ids that can be safely discarded as explained at https://mariadb.com/kb/en/gtid/#deleting-unused-domains

Domain id 0 is from an old server that was deleted years ago and can be discarded.

Domain id 2886731673 was the id used by clouddb1001 before it was shut down in T333471, and can also be discarded.

2886731301 is the id of tools-db-1 and the one we want to keep because it's the one currently being used for replicating from tools-db-1 to tools-db-2.

//tools-db-1

MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE 'gtid%';
+-------------------------+-------------------------------------------------------------------------------------------+
| Variable_name           | Value                                                                                     |
+-------------------------+-------------------------------------------------------------------------------------------+
| gtid_binlog_pos         | 2886731301-2886731301-143124522,2886731673-2886731673-4887243158                          |
| gtid_binlog_state       | 2886731301-2886731301-143124522,2886731673-2886731673-4887243158                          |
| gtid_cleanup_batch_size | 64                                                                                        |
| gtid_current_pos        | 0-2886731673-33522724637,2886731301-2886731301-143124522,2886731673-2886731673-4887243158 |
| gtid_domain_id          | 2886731301                                                                                |
| gtid_ignore_duplicates  | OFF                                                                                       |
| gtid_pos_auto_engines   |                                                                                           |
| gtid_slave_pos          | 0-2886731673-33522724637,2886731673-2886731673-4887243158                                 |
| gtid_strict_mode        | OFF                                                                                       |
+-------------------------+-------------------------------------------------------------------------------------------+

//tools-db-2

MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE 'gtid%';
+-------------------------+-------------------------------------------------------------------------------------------+
| Variable_name           | Value                                                                                     |
+-------------------------+-------------------------------------------------------------------------------------------+
| gtid_binlog_pos         | 2886731301-2886731301-143124674,2886731673-2886731673-4887243158                          |
| gtid_binlog_state       | 2886731301-2886731301-143124674,2886731673-2886731673-4887243158                          |
| gtid_cleanup_batch_size | 64                                                                                        |
| gtid_current_pos        | 0-2886731673-33522724637,2886731301-2886731301-143124674,2886731673-2886731673-4887243158 |
| gtid_domain_id          | 2886730873                                                                                |
| gtid_ignore_duplicates  | OFF                                                                                       |
| gtid_pos_auto_engines   |                                                                                           |
| gtid_slave_pos          | 0-2886731673-33522724637,2886731301-2886731301-143124674,2886731673-2886731673-4887243158 |
| gtid_strict_mode        | OFF                                                                                       |
+-------------------------+-------------------------------------------------------------------------------------------+

Event Timeline

fnegri updated the task description. (Show Details)

My plan is to run on both tools-db-1 (primary) and tools-db-2 (secondary):

FLUSH BINARY LOGS DELETE_DOMAIN_ID=(0,2886731673);

From the documentation, it looks like I will also have to manually update gtid_slave_pos in tools-db-2 (secondary):

SET gtid_slave_pos="<position with domains removed>"`

And probably also clear gtid_slave_pos in tools-db-1 (primary) as it's no longer replicating from other servers.

SET GLOBAL gtid_slave_pos = '';
fnegri changed the task status from Open to Stalled.Apr 18 2023, 4:02 PM
fnegri added subscribers: Marostegui, jcrespo.

After speaking with @Marostegui and @jcrespo, I'm postponing this for now as they experienced a few issues while trying to do the same thing in other servers. I'm linking this to T324965 that is tracking similar clean-up operations.

After some testing in a local dev environment, I am quite confident this can be done safely in ToolsDB.

I started by clearing up gtid_slave_pos in the current primary (tools-db-4), as it was only containing old domains. That also simplified gtid_current_pos which is computed from the values of gtid_binlog_pos and gtid_slave_pos.

MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE '%gtid%';
+-------------------------+-----------------------------------------------------------------------------------------------------------------------------+
| Variable_name           | Value                                                                                                                       |
+-------------------------+-----------------------------------------------------------------------------------------------------------------------------+
| gtid_binlog_pos         | 2886729896-2886729896-3085230359,2886731301-2886731301-9248274392                                                           |
| gtid_binlog_state       | 2886729896-2886729896-3085230359,2886731301-2886731301-9248274392                                                           |
| gtid_cleanup_batch_size | 64                                                                                                                          |
| gtid_current_pos        | 0-2886731673-33522724637,2886729896-2886729896-3085230359,2886731301-2886731301-9248274392,2886731673-2886731673-4887243158 |
| gtid_domain_id          | 2886729896                                                                                                                  |
| gtid_ignore_duplicates  | OFF                                                                                                                         |
| gtid_pos_auto_engines   |                                                                                                                             |
| gtid_slave_pos          | 0-2886731673-33522724637,2886731301-2886731301-9248274392,2886731673-2886731673-4887243158                                  |
| gtid_strict_mode        | OFF                                                                                                                         |
| wsrep_gtid_domain_id    | 0                                                                                                                           |
| wsrep_gtid_mode         | OFF                                                                                                                         |
+-------------------------+-----------------------------------------------------------------------------------------------------------------------------+
11 rows in set (0.002 sec)

MariaDB [(none)]> SET GLOBAL gtid_slave_pos = '';
Query OK, 0 rows affected, 1 warning (0.046 sec)

MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE '%gtid%';
+-------------------------+-------------------------------------------------------------------+
| Variable_name           | Value                                                             |
+-------------------------+-------------------------------------------------------------------+
| gtid_binlog_pos         | 2886729896-2886729896-3085231778,2886731301-2886731301-9248274392 |
| gtid_binlog_state       | 2886729896-2886729896-3085231778,2886731301-2886731301-9248274392 |
| gtid_cleanup_batch_size | 64                                                                |
| gtid_current_pos        | 2886729896-2886729896-3085231778                                  |
| gtid_domain_id          | 2886729896                                                        |
| gtid_ignore_duplicates  | OFF                                                               |
| gtid_pos_auto_engines   |                                                                   |
| gtid_slave_pos          |                                                                   |
| gtid_strict_mode        | OFF                                                               |
| wsrep_gtid_domain_id    | 0                                                                 |
| wsrep_gtid_mode         | OFF                                                               |
+-------------------------+-------------------------------------------------------------------+
11 rows in set (0.002 sec)
fnegri changed the task status from Stalled to In Progress.May 14 2025, 3:05 PM
fnegri edited projects, added Toolforge (Toolforge iteration 20); removed Toolforge.

FLUSH BINARY LOGS DELETE_DOMAIN_ID worked and we're left with only one domain_id:

MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE '%gtid%';
+-------------------------+-------------------------------------------------------------------+
| Variable_name           | Value                                                             |
+-------------------------+-------------------------------------------------------------------+
| gtid_binlog_pos         | 2886729896-2886729896-3085904864,2886731301-2886731301-9248274392 |
| gtid_binlog_state       | 2886729896-2886729896-3085904864,2886731301-2886731301-9248274392 |
| gtid_cleanup_batch_size | 64                                                                |
| gtid_current_pos        | 2886729896-2886729896-3085904864                                  |
| gtid_domain_id          | 2886729896                                                        |
| gtid_ignore_duplicates  | OFF                                                               |
| gtid_pos_auto_engines   |                                                                   |
| gtid_slave_pos          |                                                                   |
| gtid_strict_mode        | OFF                                                               |
| wsrep_gtid_domain_id    | 0                                                                 |
| wsrep_gtid_mode         | OFF                                                               |
+-------------------------+-------------------------------------------------------------------+
11 rows in set (0.002 sec)

MariaDB [(none)]> FLUSH BINARY LOGS DELETE_DOMAIN_ID=(2886731301);
Query OK, 0 rows affected (0.101 sec)

MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE '%gtid%';
+-------------------------+----------------------------------+
| Variable_name           | Value                            |
+-------------------------+----------------------------------+
| gtid_binlog_pos         | 2886729896-2886729896-3085912177 |
| gtid_binlog_state       | 2886729896-2886729896-3085912177 |
| gtid_cleanup_batch_size | 64                               |
| gtid_current_pos        | 2886729896-2886729896-3085912177 |
| gtid_domain_id          | 2886729896                       |
| gtid_ignore_duplicates  | OFF                              |
| gtid_pos_auto_engines   |                                  |
| gtid_slave_pos          |                                  |
| gtid_strict_mode        | OFF                              |
| wsrep_gtid_domain_id    | 0                                |
| wsrep_gtid_mode         | OFF                              |
+-------------------------+----------------------------------+
11 rows in set (0.001 sec)

In the replica I could not remove all the obsolete domain IDs yet, because the replica is currently lagging (T393766: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-05-09) and it's still referencing binlogs in the primary where two domain IDs were present (they were created before I ran FLUSH BINARY LOGS DELETE_DOMAIN_ID):

root@tools-db-4:/srv/labsdb/binlogs# mysqlbinlog --base64-output=decode-rows --verbose log.040364 |grep -A1 "Gtid list"
#250514 14:11:43 server id 2886729896  end_log_pos 315 CRC32 0x15fbeef4         Gtid list [2886729896-2886729896-3085906458,
# 2886731301-2886731301-9248274392]

root@tools-db-4:/srv/labsdb/binlogs# mysqlbinlog --base64-output=decode-rows --verbose log.040365 |grep -A1 "Gtid list"
#250514 14:12:06 server id 2886729896  end_log_pos 299 CRC32 0x8614092f         Gtid list [2886729896-2886729896-3085911452]
# at 299

In the replica, I tried SET GLOBAL gtid_slave_pos = '2886729896-2886729896-2995730451'; but that resulted in:

May 14 14:43:45 tools-db-5 mysqld[2130730]: 2025-05-14 14:43:45 701 [Note] Slave I/O thread: connected to master 'repl@tools-db-4.tools.eqiad1.wikimedia.cloud:3306',replication starts at GTID position '2886729896-2886729896-2995762075'
May 14 14:43:45 tools-db-5 mysqld[2130730]: 2025-05-14 14:43:45 701 [ERROR] Error reading packet from server: Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged. (server_errno=1236)
May 14 14:43:45 tools-db-5 mysqld[2130730]: 2025-05-14 14:43:45 701 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.', Internal MariaDB error code: 1236

This worked: SET GLOBAL gtid_slave_pos = '2886729896-2886729896-2995730451,2886731301-2886731301-9248274392';.

As soon as the replica is back in sync with the primary, I should be able to remove the obsolete 2886731301 from gtid_slave_pos.

fnegri moved this task from In progress to Done on the cloud-services-team (FY2024/2025-Q3-Q4) board.

The replica is not back in sync yet, but it's past the moment where I discarded the domain on the primary (log.040365 on the primary). I stopped the slave and successfully removed the obsolete domain:

STOP SLAVE;
SET GLOBAL gtid_slave_pos = '2886729896-2886729896-3090015973';
START SLAVE;

We now have a single domain id both in the primary and the replica:

# tools-db-4

MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE 'gtid%';
+-------------------------+----------------------------------+
| Variable_name           | Value                            |
+-------------------------+----------------------------------+
| gtid_binlog_pos         | 2886729896-2886729896-3099698294 |
| gtid_binlog_state       | 2886729896-2886729896-3099698294 |
| gtid_cleanup_batch_size | 64                               |
| gtid_current_pos        | 2886729896-2886729896-3099698294 |
| gtid_domain_id          | 2886729896                       |
| gtid_ignore_duplicates  | OFF                              |
| gtid_pos_auto_engines   |                                  |
| gtid_slave_pos          |                                  |
| gtid_strict_mode        | OFF                              |
+-------------------------+----------------------------------+
# tools-db-5

MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE 'gtid%';
+-------------------------+----------------------------------+
| Variable_name           | Value                            |
+-------------------------+----------------------------------+
| gtid_binlog_pos         | 2886729896-2886729896-3090512918 |
| gtid_binlog_state       | 2886729896-2886729896-3090512918 |
| gtid_cleanup_batch_size | 64                               |
| gtid_current_pos        | 2886729896-2886729896-3090512917 |
| gtid_domain_id          | 2886730325                       |
| gtid_ignore_duplicates  | OFF                              |
| gtid_pos_auto_engines   |                                  |
| gtid_slave_pos          | 2886729896-2886729896-3090512918 |
| gtid_strict_mode        | OFF                              |
+-------------------------+----------------------------------+