Page MenuHomePhabricator

[toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-06-21
Closed, ResolvedPublic

Description

Replication is taking days to complete the following transaction (updating a table with 73 million rows, and no index). The table no longer exists in the primary db.

root@tools-db-1:/srv/labsdb/binlogs# mysqlbinlog --base64-output=decode-rows --verbose --start-position 21942871 log.097083 |less
[...]
#Q> update phash_dhash set dhashv=NULL
#240621 17:07:33 server id 2886731301  end_log_pos 21943057 CRC32 0xf3743fa7    Table_map: `s55462__imagehashpublic_p`.`phash_dhash` mapped to number 123368073

Event Timeline

fnegri changed the task status from Open to In Progress.Jun 24 2024, 9:38 AM
fnegri triaged this task as High priority.

After 5 days, replication is still stuck on the same transaction. I will try to apply the transaction manually, with the following commands in the replica:

-- tools-db-3
STOP SLAVE;
SET sql_log_bin = 0;
USE s55462__imagehashpublic_p;
UPDATE phash_dhash SET dhashv=NULL;
SET sql_log_bin = 1;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
START SLAVE;

Other options if this does not work:

  • Skipping the transaction. This is likely to produce an inconsistent replica, because I checked the primary log and the same table is used in other queries after the one that is currently stuck.
  • Rebuilding a new replica from scratch. We have to do it anyway to upgrade from Bullseye to Bookworm.

Both STOP SLAVE and systemctl stop mariadb were hanging, I had to use kill -9 :/

The manual UPDATE query is currently running, I expect it to take a while but less than 30 minutes, because it was not logged in the slow query log of the primary (where we log all queries taking longer than 30 mins).

My theory is that running the query manually is much faster than replicating it from the primary, because we use RBR and that translates that single query into millions of events in the binary log, and the replica has to apply each event individually which is much slower.

My theory was apparently correct, because the UPDATE query (UPDATE phash_dhash SET dhashv=NULL;) completed in just 11 minutes.

SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; did not work:

MariaDB [s55462__imagehashpublic_p]> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
ERROR 1966 (HY000): When using parallel replication and GTID with multiple replication domains, @@sql_slave_skip_counter can not be used. Instead, setting @@gtid_slave_pos explicitly can be used to skip to after a given GTID position

I had to check the next GTID in the primary binlog:

root@tools-db-1:/srv/labsdb/binlogs# mysqlbinlog --base64-output=decode-rows --verbose --start-position 21942800 log.097083 |less

And set it in the replica:

MariaDB [s55462__imagehashpublic_p]> SET GLOBAL gtid_slave_pos="0-2886731673-33522724637,2886731301-2886731301-6672708315,2886731673-2886731673-4887243158";
MariaDB [s55462__imagehashpublic_p]> START SLAVE;

Replication has now resumed.

Replication is finally back in sync.

Screenshot 2024-06-28 at 10.14.21.png (616×1 px, 90 KB)

I am resolving this task. I have added some new ideas to the parent task T357624.