Page MenuHomePhabricator

[toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-05-09
Closed, ResolvedPublic

Description

The replica is lagging because of a big DELETE transaction that translated to many rows in row-based-replication:

root@tools-db-4:/srv/labsdb/binlogs# mysqlbinlog --base64-output=decode-rows --verbose --start-position 34717375 log.038614 |head -n30
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 34717375
#250508 21:20:10 server id 2886729896  end_log_pos 34717417 CRC32 0x28242e8f    GTID 2886729896-2886729896-2995730451 trans
/*M!100101 SET @@session.skip_parallel_replication=0*//*!*/;
/*M!100001 SET @@session.gtid_domain_id=2886729896*//*!*/;
/*M!100001 SET @@session.server_id=2886729896*//*!*/;
/*M!100001 SET @@session.gtid_seq_no=2995730451*//*!*/;
START TRANSACTION
/*!*/;
# at 34717417
# at 34717460
#250508 21:20:10 server id 2886729896  end_log_pos 34717460 CRC32 0x205c13bb    Annotate_rows:
#Q> DELETE FROM qid_rank
#250508 21:20:10 server id 2886729896  end_log_pos 34717636 CRC32 0x6ef18ac7    Table_map: `s56470__misssearch`.`qid_rank` mapped to number 6896083
# at 34717636
# at 34725823
# at 34734010
# at 34742197
# at 34750384
# at 34758571
# at 34766758
# at 34774945
# at 34783132
# at 34791319
# at 34799506
# at 34807693
# at 34815880

[...]

# at 213325228

Event Timeline

fnegri changed the task status from Open to In Progress.
fnegri triaged this task as High priority.

I expect the replica will eventually catch up, but based on previous experience it might take multiple days. If it's still lagging on Monday I will consider recreating a new replica from scratch.

The replica is still processing the same transaction. Judging from the output of SHOW ENGINE INNODB STATUS it will take another day or two:

13046 lock struct(s), heap size 1482872, 1395741 row lock(s), undo log entries 1382697
MariaDB thread id 704, OS thread handle 140111951222464, query id 441571361 Delete_rows_log_event::find_row(-1) on table `qid_rank`

There are 1395741 row locks (growing about 1 per second) and the table has 1505358 rows.

MariaDB [(none)]> SELECT COUNT(*) FROM s56470__misssearch.qid_rank;
+----------+
| COUNT(*) |
+----------+
|  1505358 |
+----------+

I think to speed things up it would be safe to run the stuck transaction manually on the replica (DELETE FROM qid_rank), that should complete quickly because it's a single query, while with row-based-replication it's translated as 1 million "row delete" actions.

This is similar to what I did in the past, see T368250: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-06-21.

I will look into this tomorrow.

The relevant value is not row locks but undo log entries, which is currently at 1501638, so very close to the row count of 1505358.

I will proceed anyway with the manual procedure, because it's a good opportunity to test it and document it for future situations.

This is what I will run on the replica:

-- tools-db-5
STOP SLAVE;
SET sql_log_bin = 0;
USE s56470__misssearch;
DELETE FROM qid_rank;
SET sql_log_bin = 1;
SET GLOBAL gtid_slave_pos = '2886729896-2886729896-2995730451';
START SLAVE;

The value for gtid_slave_pos matches the one of the currently stuck transaction that we want to skip. We cannot use sql_slave_skip_counter because we have multiple replication domains, I plan to fix that in T334947: ToolsDB: discard obsolete GTID domains but setting gtid_slave_pos is an easy workaround as suggested by the official docs.

The current value of gtid_slave_pos is:

MariaDB [(none)]> SELECT @@gtid_slave_pos;
+-----------------------------------------------------------------------------------------------------------------------------+
| @@gtid_slave_pos                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------------------+
| 0-2886731673-33522724637,2886729896-2886729896-2995730450,2886731301-2886731301-9248274392,2886731673-2886731673-4887243158 |
+-----------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.001 sec)

The important part is 2886729896-2886729896-2995730450, the additional GTID positions for different domains can be ignored as they reference old domains that are no longer in use.

STOP SLAVE; did not work, it just hangs waiting for the transaction to complete. Killing the query doesn't work either because it only kills one "row event" and not the full transaction. SHUTDOWN did work, and I can no longer connect to the MySQL port, but the mysql process is still using 100% CPU. systemctl stop mariadb is also hanging systemctl status mariadb shows deactivating (stop-sigterm).

I'll give it a few minutes to see if it completes the shutdown.

I think STOP SLAVE hanged because it was trying to roll back the partially-applied transaction. I waited 1 hour, then killed the mysqld process with kill -9.

Restarting the service, MariaDB attempted to resume the replication from the same transaction, but this time STOP SLAVE worked, and I could proceed with my plan above:

STOP SLAVE;
SET sql_log_bin = 0;
USE s56470__misssearch;
DELETE FROM qid_rank;
SET sql_log_bin = 1;
SET GLOBAL gtid_slave_pos = '2886729896-2886729896-2995730451';
START SLAVE;

This worked except for the fact that MariaDB doesn't like the short version of gtid_slave_pos and complains with:

May 14 14:43:45 tools-db-5 mysqld[2130730]: 2025-05-14 14:43:45 701 [Note] Slave I/O thread: connected to master 'repl@tools-db-4.tools.eqiad1.wikimedia.cloud:3306',replication starts at GTID position '2886729896-2886729896-2995762075'
May 14 14:43:45 tools-db-5 mysqld[2130730]: 2025-05-14 14:43:45 701 [ERROR] Error reading packet from server: Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged. (server_errno=1236)
May 14 14:43:45 tools-db-5 mysqld[2130730]: 2025-05-14 14:43:45 701 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.', Internal MariaDB error code: 1236

The binlogs in the primary contain two domain IDs, so both are required:

root@tools-db-4:/srv/labsdb/binlogs# mysqlbinlog --base64-output=decode-rows --verbose log.038615 |grep -A1 "Gtid list"
#250508 21:20:10 server id 2886729896  end_log_pos 315 CRC32 0x3264e3a5         Gtid list [2886729896-2886729896-2995730451,
# 2886731301-2886731301-9248274392]

This worked:

SET GLOBAL gtid_slave_pos = '2886729896-2886729896-2995730451,2886731301-2886731301-9248274392';
START SLAVE;

The replica is now catching up from the primary.

GTIDs in the replica are now more readable, but there is still one domain that can be removed:

MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE 'gtid%';
+-------------------------+-------------------------------------------------------------------+
| Variable_name           | Value                                                             |
+-------------------------+-------------------------------------------------------------------+
| gtid_binlog_pos         | 2886729896-2886729896-2995762075                                  |
| gtid_binlog_state       | 2886729896-2886729896-2995762075                                  |
| gtid_cleanup_batch_size | 64                                                                |
| gtid_current_pos        | 2886729896-2886729896-2995762075,2886731301-2886731301-9248274392 |
| gtid_domain_id          | 2886730325                                                        |
| gtid_ignore_duplicates  | OFF                                                               |
| gtid_pos_auto_engines   |                                                                   |
| gtid_slave_pos          | 2886729896-2886729896-2995762075,2886731301-2886731301-9248274392 |
| gtid_strict_mode        | OFF                                                               |
+-------------------------+-------------------------------------------------------------------+
9 rows in set (0.002 sec)

I will do this as part of T334947: ToolsDB: discard obsolete GTID domains.

fnegri moved this task from In Progress to Done on the Toolforge (Toolforge iteration 20) board.

The replica is back in sync.

Screenshot 2025-05-15 at 14.41.27.png (616×1 px, 89 KB)