The replica tools-db-2 is currently lagging 21 hours behind the primary tools-db-1 (Grafana chart). This is (correctly) triggering the alert ToolsToolsDBReplicationLagIsTooHigh.
SHOW SLAVE STATUS\g in the replica shows the replication is active, but it's taking hours to process a single transaction. Notice how Slave_IO_Running: Yes and Slave_SQL_Running: Yes, but Exec_Master_Log_Pos (indicating the last transaction that was successfully replicated from the primary) is not moving.
This happened before (T338031) and Slave_SQL_Running_State: Delete_rows_log_event::find_row(-1) seems to indicate a similar problem where a big DELETE query in the primary is taking hours to apply in the replica.
MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: tools-db-1.tools.eqiad1.wikimedia.cloud
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: log.023109
Read_Master_Log_Pos: 28120756
Relay_Log_File: tools-db-2-relay-bin.027178
Relay_Log_Pos: 18624967
Relay_Master_Log_File: log.023018
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 18624674
Relay_Log_Space: 9771520530
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 91203
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 2886731301
Master_SSL_Crl:
Master_SSL_Crlpath:
Using_Gtid: Slave_Pos
Gtid_IO_Pos: 0-2886731673-33522724637,2886731673-2886731673-4887243158,2886731301-2886731301-1338530527
Replicate_Do_Domain_Ids:
Replicate_Ignore_Domain_Ids:
Parallel_Mode: conservative
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Delete_rows_log_event::find_row(-1)
Slave_DDL_Groups: 432265
Slave_Non_Transactional_Groups: 156747552
Slave_Transactional_Groups: 409816287
1 row in set (0.000 sec)