Page MenuHomePhabricator

[toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-01-19
Closed, ResolvedPublic

Description

The replica tools-db-2 is currently lagging 18 hours behind the primary tools-db-1 (Grafana chart). This is (correctly) triggering the alert ToolsToolsDBReplicationLagIsTooHigh.

SHOW SLAVE STATUS\g in the replica shows the replication is active, but it's lag is increasing (little by little now, started suddenly).

This happened before (T343819, T341891 and T338031). Both times the replica caught up after a couple of days, without any intervention. I will check if the database causing this issue is the same or a new one.

Event Timeline

dcaro changed the task status from Open to In Progress.Jan 19 2024, 11:07 AM
dcaro triaged this task as High priority.
dcaro created this task.
dcaro moved this task from To refine to Doing on the User-dcaro board.

@dcaro let me know if the runbook is not clear enough on how to find the database and query that is currently blocking the replication.

Got a cookbook running for it :)

the output:

dcaro@urcuchillay$ wmcs-cookbooks wmcs.toolforge.toolsdb.get_current_replica_transaction
Got matching cookbooks wmcs.toolforge.toolsdb.get_current_replica_transaction
START - Cookbook wmcs.toolforge.toolsdb.get_current_replica_transaction
Suspicious tables:
    `s55593__PAGES`.`link_map`
Suspicious queries:
    #Q> DELETE FROM link_map
Raw logs:
------------
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 28815896
#240118 21:03:07 server id 2886731301  end_log_pos 28815938 CRC32 0xc056258f    GTID 2886731301-2886731301-4304047644 trans
/*!100101 SET @@session.skip_parallel_replication=0*//*!*/;
/*!100001 SET @@session.gtid_domain_id=2886731301*//*!*/;
/*!100001 SET @@session.server_id=2886731301*//*!*/;
/*!100001 SET @@session.gtid_seq_no=4304047644*//*!*/;
START TRANSACTION
/*!*/;
# at 28815938
# at 28815981
#240118 21:03:07 server id 2886731301  end_log_pos 28815981 CRC32 0xad9a461c    Annotate_rows:
#Q> DELETE FROM link_map
#240118 21:03:07 server id 2886731301  end_log_pos 28816049 CRC32 0xd5a8a40a    Table_map: `s55593__PAGES`.`link_map` mapped to number 21006987
# at 28816049
# at 28824238
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
------------
END (PASS) - Cookbook wmcs.toolforge.toolsdb.get_current_replica_transaction (exit_code=0)

waiting for a delete on link_map

The query is now unstuck, the replica started processing other queries, should start picking up soon

dcaro moved this task from Doing to Done on the User-dcaro board.

The replica has picked up the slack already :)

image.png (1×2 px, 53 KB)