Page MenuHomePhabricator

[toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2023-07-13
Closed, ResolvedPublic

Description

The replica tools-db-2 is currently lagging 21 hours behind the primary tools-db-1 (Grafana chart). This is (correctly) triggering the alert ToolsToolsDBReplicationLagIsTooHigh.

SHOW SLAVE STATUS\g in the replica shows the replication is active, but it's taking hours to process a single transaction. Notice how Slave_IO_Running: Yes and Slave_SQL_Running: Yes, but Exec_Master_Log_Pos (indicating the last transaction that was successfully replicated from the primary) is not moving.

This happened before (T338031) and Slave_SQL_Running_State: Delete_rows_log_event::find_row(-1) seems to indicate a similar problem where a big DELETE query in the primary is taking hours to apply in the replica.

MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: tools-db-1.tools.eqiad1.wikimedia.cloud
                   Master_User: repl
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File: log.023109
           Read_Master_Log_Pos: 28120756
                Relay_Log_File: tools-db-2-relay-bin.027178
                 Relay_Log_Pos: 18624967
         Relay_Master_Log_File: log.023018
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB:
           Replicate_Ignore_DB:
            Replicate_Do_Table:
        Replicate_Ignore_Table:
       Replicate_Wild_Do_Table:
   Replicate_Wild_Ignore_Table:
                    Last_Errno: 0
                    Last_Error:
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 18624674
               Relay_Log_Space: 9771520530
               Until_Condition: None
                Until_Log_File:
                 Until_Log_Pos: 0
            Master_SSL_Allowed: No
            Master_SSL_CA_File:
            Master_SSL_CA_Path:
               Master_SSL_Cert:
             Master_SSL_Cipher:
                Master_SSL_Key:
         Seconds_Behind_Master: 91203
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error:
                Last_SQL_Errno: 0
                Last_SQL_Error:
   Replicate_Ignore_Server_Ids:
              Master_Server_Id: 2886731301
                Master_SSL_Crl:
            Master_SSL_Crlpath:
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos: 0-2886731673-33522724637,2886731673-2886731673-4887243158,2886731301-2886731301-1338530527
       Replicate_Do_Domain_Ids:
   Replicate_Ignore_Domain_Ids:
                 Parallel_Mode: conservative
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Delete_rows_log_event::find_row(-1)
              Slave_DDL_Groups: 432265
Slave_Non_Transactional_Groups: 156747552
    Slave_Transactional_Groups: 409816287
1 row in set (0.000 sec)

Event Timeline

fnegri changed the task status from Open to In Progress.Jul 14 2023, 4:19 PM
fnegri triaged this task as High priority.
fnegri moved this task from Backlog to ToolsDB on the Data-Services board.
fnegri moved this task from Backlog to In progress on the cloud-services-team (FY2022/2023-Q4) board.

I found the query that is taking hours to replicate by reading the binlog on the primary, using the filename and position identified by Relay_Master_Log_File and Exec_Master_Log_Pos in the output posted above:

root@tools-db-1:/srv/labsdb/binlogs# mysqlbinlog --base64-output=decode-rows --verbose --start-position 18624674 log.023018 |less

# at 18624674
#230713 14:46:23 server id 2886731301  end_log_pos 18624716 CRC32 0xb919d15a    GTID 2886731301-2886731301-1326133803 trans
/*!100101 SET @@session.skip_parallel_replication=0*//*!*/;
/*!100001 SET @@session.gtid_domain_id=2886731301*//*!*/;
/*!100001 SET @@session.server_id=2886731301*//*!*/;
/*!100001 SET @@session.gtid_seq_no=1326133803*//*!*/;
START TRANSACTION
/*!*/;
# at 18624716
# at 18629144
#230713 14:46:23 server id 2886731301  end_log_pos 18629144 CRC32 0x968f2bf1    Annotate_rows:
#Q> # 11min
#Q> DELETE FROM cnd_tmp WHERE catalog IN
#Q> (1,2,6, ...... [...omitting long list of ids...]
#230713 14:46:23 server id 2886731301  end_log_pos 18629216 CRC32 0xd2bdfb00    Table_map: `s51434__mixnmatch_p`.`cnd_tmp` mapped to number 95109516
# at 18629216
# at 18637403

[...omitting thousands of similar rows...]

# at 102011075
#230713 14:46:23 server id 2886731301  end_log_pos 18637403 CRC32 0x5503acae    Delete_rows: table id 95109516

[...omitting thousands of similar rows...]

#230713 14:46:23 server id 2886731301  end_log_pos 102018521 CRC32 0xe9130088   Delete_rows: table id 95109516 flags: STMT_END_F
### DELETE FROM `s51434__mixnmatch_p`.`cnd_tmp`
### WHERE
###   @1=57559613
###   @2='-0134--0086'
###   @3='Midi Jin'
###   @4=1640

[...omitting thousands of similar DELETE blocks...]

# Number of rows: 2085486
# at 102018521
#230713 14:46:23 server id 2886731301  end_log_pos 102018552 CRC32 0xedbcbfba   Xid = 12979510678
COMMIT/*!*/;
# at 102018552
#230713 14:46:31 server id 2886731301  end_log_pos 102018596 CRC32 0x721c4d2c   GTID 2886731301-2886731301-1326133804 cid=12979521725 trans
/*!100001 SET @@session.gtid_seq_no=1326133804*//*!*/;

As visible in the binlog above, the query is DELETE FROM cnd_tmp WHERE catalog IN (....) on table s51434__mixnmatch_p.cnd_tmp used by the Mix'n'match tool.

Even in the primary, there is a very long query currently active against that table (looks like a big data import):

MariaDB [s51434__mixnmatch_p]> SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST WHERE DB='s51434__mixnmatch_p' AND TIME > 3000\G
*************************** 1. row ***************************
             ID: 406286017
           USER: s51434
           HOST: 172.16.1.142:41354
             DB: s51434__mixnmatch_p
        COMMAND: Query
           TIME: 63865
          STATE: Waiting for table metadata lock
           INFO: SELECT engine, table_type FROM INFORMATION_SCHEMA.TABLES WHERE table_schema = DATABASE() AND table_name = 'cnd_tmp'
        TIME_MS: 63865317.858
          STAGE: 0
      MAX_STAGE: 0
       PROGRESS: 0.000
    MEMORY_USED: 191888
MAX_MEMORY_USED: 411504
  EXAMINED_ROWS: 0
       QUERY_ID: 13013098119
    INFO_BINARY: SELECT engine, table_type FROM INFORMATION_SCHEMA.TABLES WHERE table_schema = DATABASE() AND table_name = 'cnd_tmp'
            TID: 3446681
*************************** 2. row ***************************
             ID: 401717440
           USER: s51434
           HOST: 172.16.6.95:38588
             DB: s51434__mixnmatch_p
        COMMAND: Query
           TIME: 89342
          STATE: Sending data
           INFO: CREATE TABLE cnd_tmp
SELECT entry_id,concat(year_born,'-',year_died) AS dates,ext_name,catalog
FROM person_dates USE INDEX(`is_matched`),entry USE INDEX(`q`)
WHERE year_born!='' AND year_died!='' AND is_matched=0
AND entry_id=entry.id AND (entry.q IS NULL OR entry.user=0)
#AND catalog IN (SELECT id FROM catalog WHERE active=1 AND has_person_date='yes' AND id NOT IN (70))
        TIME_MS: 89342083.799
          STAGE: 0
      MAX_STAGE: 0
       PROGRESS: 0.000
    MEMORY_USED: 380304
MAX_MEMORY_USED: 17308864
  EXAMINED_ROWS: 0
       QUERY_ID: 12985181073
    INFO_BINARY: CREATE TABLE cnd_tmp
SELECT entry_id,concat(year_born,'-',year_died) AS dates,ext_name,catalog
FROM person_dates USE INDEX(`is_matched`),entry USE INDEX(`q`)
WHERE year_born!='' AND year_died!='' AND is_matched=0
AND entry_id=entry.id AND (entry.q IS NULL OR entry.user=0)
#AND catalog IN (SELECT id FROM catalog WHERE active=1 AND has_person_date='yes' AND id NOT IN (70))
            TID: 62647
2 rows in set (0.003 sec)

Tagging @Magnus as the mix-n-match maintainer, I believe all queries will eventually complete, but it looks like there are a few things that can be optimized. This Stackoverflow thread suggests that having an index on id would make a big difference (at least for replication). Splitting big deletes and imports in smaller chunks could be another idea.

fnegri renamed this task from [toolsdb] High replication lag caused by big DELETE query to [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2023-07-13.Jul 14 2023, 5:32 PM
fnegri updated the task description. (Show Details)

Replication lag is back to normal:

Screenshot 2023-07-17 at 10.59.16.png (616×1 px, 111 KB)