Page MenuHomePhabricator

[toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2023-06-02
Closed, ResolvedPublic

Description

It seems that the replication lag between the tooldb master and the replica is too high.

Looking at the relpica the slave process is running:

MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: tools-db-1.tools.eqiad1.wikimedia.cloud
                   Master_User: repl
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File: log.014138
           Read_Master_Log_Pos: 42579366
                Relay_Log_File: tools-db-2-relay-bin.021825
                 Relay_Log_Pos: 16572178
         Relay_Master_Log_File: log.014015
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
...

But the master says that it has sent everything already:

root@tools-db-1:~# mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 221909594
Server version: 10.4.28-MariaDB-log MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> SHOW PROCESSLIST;
+-----------+-----------------+--------------------+---------------------------------+-------------+---------+-----------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
| Id        | User            | Host               | db                              | Command     | Time    | State                                                                 | Info                                                                                                 | Progress |
+-----------+-----------------+--------------------+---------------------------------+-------------+---------+-----------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
...
|  84920738 | repl            | 172.16.4.121:34792 | NULL                            | Binlog Dump | 3378008 | Master has sent all binlog to slave; waiting for binlog to be updated | NULL                                                                                                 |    0.000 |
...

Tried to start the slave (START SLAVE;), but did nothing, so restarted the mariadb process in the replica, and that seemed to re-bootstrap the slave:

root@tools-db-2:~# systemctl restart mariadb
root@tools-db-2:~# journalctl -u mariadb -n 10000 -f
...
Jun 02 13:15:56 tools-db-2 mysqld[411785]: 2023-06-02 13:15:56 11 [Note] Slave I/O thread: connected to master 'repl@tools-db-1.tools.eqiad1.wikimedia.cloud:3306',replication starts at GTID position '0-2886731673-33522724637,2886731673-2886731673-4887243158,2886731301-2886731301-759137700'

On the primary side:

MariaDB [(none)]> show processlist;
| 221945597 | repl                 | 172.16.4.121:34794 | NULL                            | Binlog Dump |     373 | Writing to net           | NULL                                                                                                 |    0.000 |

Let's see where that goes.

Event Timeline

dcaro triaged this task as High priority.Jun 2 2023, 1:22 PM
dcaro created this task.

Replication is active, but it seems to be stuck on one (or more) transactions. SHOW PROCESSLIST shows this query being re-executed every 10 seconds or so:

| 141 | system user     |           | NULL | Slave_SQL |    2 | Delete_rows_log_event::find_row(-1) | -- delete FROM file_metadata where asset_url LIKE 'https://nasa-external-ocomm.app.box.com/s/onrtmdv |    0.000 |

I found on the primary the transaction that the replica is currently processing:

root@tools-db-1:/srv/labsdb/binlogs#  mysqlbinlog --no-defaults --verbose log.014015 |less

[...]

#230601 18:14:33 server id 2886731301  end_log_pos 16571927 CRC32 0x264a9c65    GTID 2886731301-2886731301-759137701 trans
/*!100001 SET @@session.gtid_seq_no=759137701*//*!*/;
START TRANSACTION
/*!*/;
# at 16571927
# at 16572460
#230601 18:14:33 server id 2886731301  end_log_pos 16572460 CRC32 0xfb60c3df    Annotate_rows:
#Q> -- delete FROM file_metadata where asset_url LIKE 'https://nasa-external-ocomm.app.box.com/s/onrtmdvofqluv5ei5kfu5u1pf8v4xqtl/%' ;
#Q> -- SELECT * FROM file_metadata m JOIN file_metadata_commons_file_names c WHERE c.file_metadata_id = m.id AND asset_url LIKE 'https://nasa-external-ocomm.app.box.
com/s/onrtmdvofqluv5ei5kfu5u1pf8v4xqtl/%' ;
#Q> DELETE from file_metadata_commons_file_names WHERE file_metadata_id IN (SELECT id FROM file_metadata WHERE phash = '6dp5qcb22im238nr3wvp0ic7q99w035jmy2iw7i6n43d3
7jtof
#230601 18:14:33 server id 2886731301  end_log_pos 16572552 CRC32 0xb73f861d    Table_map: `s54113__spacemedia`.`file_metadata_commons_file_names` mapped to number 2
1100963

If I'm understanding this correctly, it's a single DELETE that cascades to thousands more deletes. It might just take a long time for the replica to process all of them...

aaaaand... we're back on track :)

Screenshot 2023-06-04 at 19.35.22.png (616×1 px, 112 KB)