It was a while since the last time this happened (see the parent task T357624 for a list).
The replica has been frequently lagging durint the past week:
I believe some of the spikes were caused by repeated crashes of the primary (tracked in T385900: [toolsdb] mariadb crashing repeatedly (innodb_fatal_semaphore_wait_threshold)).
But today the replica is stuck replicating a big transaction like we saw many times in the past:
#250212 6:51:25 server id 2886729896 end_log_pos 1370378 CRC32 0x98d8bb5e Annotate_rows: #Q> DELETE FROM `store_edit` WHERE (`store_edit`.`batch_id` = 1867223 AND `store_edit`.`newrevid` < 2075780282) #250212 6:51:25 server id 2886729896 end_log_pos 1370475 CRC32 0xfab66eac Table_map: `s53685__editgroups`.`store_edit` mapped to number 1710992
This query was not logged in the slow query log of the primary, where we log all queries taking more than 30 minutes to complete. But as we saw in the past, because of row-based replication this kind of queries can take much longer on the replica than what they took in the primary.
Update 2025-02-13: the query above has completed after a few hours, but replication is now stuck on a different query, on a different table (see comments for details)
Update 2025-02-17: replication keeps getting stuck on different queries. I discovered that capturing a backtrace with gdb has the side effect of getting it unstuck (see comments for details).
I can easily recreate the replica from scratch, but I would really like to understand what's going on.
Side note: we should really migrate the s53685__editgroups database to a dedicated Trove instance, as it's currently the third-largest database on ToolsDB, using about 180GB. See also: T291782: Migrate largest ToolsDB users to Trove.





