Page MenuHomePhabricator

[toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-06-06
Closed, ResolvedPublic

Description

Replica is lagging because it's taking a long time replicating this transaction:

MariaDB [(none)]> SHOW ENGINE INNODB STATUS\G

[...]
 
---TRANSACTION 150059057945, ACTIVE 12886 sec fetching rows
mysql tables in use 1, locked 1
9598 lock struct(s), heap size 1089656, 1077588 row lock(s), undo log entries 7064
MariaDB thread id 1475843, OS thread handle 139836453676736, query id 1244989745 Delete_rows_log_event::find_row(-1) on table `qid_rank`
DELETE FROM qid_rank ORDER BY qid DESC LIMIT 100000

It's the usual situation where a query deletes many records (100000 in this case), and row-based replications translates that query into multiple Delete_rows_log_event events.

The table does not have an index on qid_rank, which makes it even slower.

Event Timeline

fnegri changed the task status from Open to In Progress.
fnegri triaged this task as High priority.

@Naorleizer your miss-search tool is causing some performance issues to ToolsDB. The problematic query was

DELETE FROM qid_rank ORDER BY qid DESC LIMIT 100000;

I will fix the issue, but I would recommend adding an INDEX to the qid column in the qid_rank table. Especially if you plan to do mass-deletes from that table in the future.

I'm gonna stop the replication, manually run the DELETE query on the replica host, then restart replication.

Plan:

STOP REPLICA;
SET sql_log_bin = 0;
USE s56470__misssearch;
DELETE FROM qid_rank ORDER BY qid DESC LIMIT 100000;
SET GLOBAL sql_slave_skip_counter = 1;
SET sql_log_bin = 1;
START REPLICA;

This will take a while, because STOP REPLICA has to playback the undo log of the partially-applied transaction.

STOP REPLICA is not enough, because it waits for all active transactions to complete:

STOP REPLICA waits until any current replication event group affecting one or more non-transactional tables has finished executing (if there is any such replication group), or until the user issues a KILL QUERY or KILL CONNECTION statement.

I tried KILL CONNECTION on the replica thread that is running the long transaction (KILL 1475843;), that did something because the ACTIVE counter in SHOW ENGINE INNODB STATUS did reset:

---TRANSACTION 150059057949, ACTIVE 1768 sec fetching rows
mysql tables in use 1, locked 1
9598 lock struct(s), heap size 1089656, 1077588 row lock(s), undo log entries 962
MariaDB thread id 1475843, OS thread handle 139836453676736, query id 1244995702 Delete_rows_log_event::find_row(-1) on table `qid_rank`
DELETE FROM qid_rank ORDER BY qid DESC LIMIT 100000

But I'm not sure if it actually did what I wanted, which was to abort and rollback that transaction.

My understanding is after my KILL, the replication thread just retried replicating the same transaction. I'm not sure if there's a way to properly stop replication in this situation. Related MariaDB issue: MDEV-31671.

I will proceed as I did last month for T393766: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-05-09, shutting down MariaDB completely on the replica, and using kill -9 if systemctl stop mariadb is not enough.

systemctl stop mariadb was stuck, so after 20 minutes I killed the process with kill -9.

Even after restarting MariaDB, STOP REPLICA was hanging. I had to use KILL on the replication thread. Even that did not work, but issuing KILL on the same thread twice did stop replication.

After that I could apply the plan as detailed above:

STOP REPLICA;
SET sql_log_bin = 0;
USE s56470__misssearch;
DELETE FROM qid_rank ORDER BY qid DESC LIMIT 100000;
SET GLOBAL sql_slave_skip_counter = 1;
SET sql_log_bin = 1;
START REPLICA;

After that, the replication resumed, but it got stuck again a few seconds later on a new transaction with the same exact query (DELETE FROM qid_rank ORDER BY qid DESC LIMIT 100000;).

I think these queries were an attempt by the tool maintainer to empty that table incrementally. This is what I could find on the binlogs of the primary server:

root@tools-db-4:/srv/labsdb/binlogs# mysqlbinlog --base64-output=decode-rows --verbose log.04505* |grep "DELETE FROM qid_rank"
#Q> DELETE FROM qid_rank ORDER BY qid DESC LIMIT 100000
#Q> DELETE FROM qid_rank ORDER BY qid DESC LIMIT 100000
#Q> DELETE FROM qid_rank ORDER BY qid DESC LIMIT 100000
#Q> DELETE FROM qid_rank ORDER BY qid DESC LIMIT 100000
#Q> DELETE FROM qid_rank ORDER BY qid DESC LIMIT 100000
#Q> DELETE FROM qid_rank

I had to repeat the plan above 6 times in total for the 6 queries, and finally replication seems to be flowing again.

fnegri moved this task from Backlog to Done on the cloud-services-team (FY2024/2025-Q3-Q4) board.

Replication is back in sync!

Screenshot 2025-06-06 at 17.56.14.png (612×1 px, 82 KB)