Page MenuHomePhabricator

[toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2023-06-02
Closed, ResolvedPublic

Description

It seems that the replication lag between the tooldb master and the replica is too high.

Looking at the relpica the slave process is running:

MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: tools-db-1.tools.eqiad1.wikimedia.cloud
                   Master_User: repl
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File: log.014138
           Read_Master_Log_Pos: 42579366
                Relay_Log_File: tools-db-2-relay-bin.021825
                 Relay_Log_Pos: 16572178
         Relay_Master_Log_File: log.014015
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
...

But the master says that it has sent everything already:

root@tools-db-1:~# mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 221909594
Server version: 10.4.28-MariaDB-log MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> SHOW PROCESSLIST;
+-----------+-----------------+--------------------+---------------------------------+-------------+---------+-----------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
| Id        | User            | Host               | db                              | Command     | Time    | State                                                                 | Info                                                                                                 | Progress |
+-----------+-----------------+--------------------+---------------------------------+-------------+---------+-----------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
...
|  84920738 | repl            | 172.16.4.121:34792 | NULL                            | Binlog Dump | 3378008 | Master has sent all binlog to slave; waiting for binlog to be updated | NULL                                                                                                 |    0.000 |
...

Tried to start the slave (START SLAVE;), but did nothing, so restarted the mariadb process in the replica, and that seemed to re-bootstrap the slave:

root@tools-db-2:~# systemctl restart mariadb
root@tools-db-2:~# journalctl -u mariadb -n 10000 -f
...
Jun 02 13:15:56 tools-db-2 mysqld[411785]: 2023-06-02 13:15:56 11 [Note] Slave I/O thread: connected to master 'repl@tools-db-1.tools.eqiad1.wikimedia.cloud:3306',replication starts at GTID position '0-2886731673-33522724637,2886731673-2886731673-4887243158,2886731301-2886731301-759137700'

On the primary side:

MariaDB [(none)]> show processlist;
| 221945597 | repl                 | 172.16.4.121:34794 | NULL                            | Binlog Dump |     373 | Writing to net           | NULL                                                                                                 |    0.000 |

Let's see where that goes.

Event Timeline

dcaro triaged this task as High priority.

Replication is active, but it seems to be stuck on one (or more) transactions. SHOW PROCESSLIST shows this query being re-executed every 10 seconds or so:

| 141 | system user     |           | NULL | Slave_SQL |    2 | Delete_rows_log_event::find_row(-1) | -- delete FROM file_metadata where asset_url LIKE 'https://nasa-external-ocomm.app.box.com/s/onrtmdv |    0.000 |

I found on the primary the transaction that the replica is currently processing:

root@tools-db-1:/srv/labsdb/binlogs#  mysqlbinlog --no-defaults --verbose log.014015 |less

[...]

#230601 18:14:33 server id 2886731301  end_log_pos 16571927 CRC32 0x264a9c65    GTID 2886731301-2886731301-759137701 trans
/*!100001 SET @@session.gtid_seq_no=759137701*//*!*/;
START TRANSACTION
/*!*/;
# at 16571927
# at 16572460
#230601 18:14:33 server id 2886731301  end_log_pos 16572460 CRC32 0xfb60c3df    Annotate_rows:
#Q> -- delete FROM file_metadata where asset_url LIKE 'https://nasa-external-ocomm.app.box.com/s/onrtmdvofqluv5ei5kfu5u1pf8v4xqtl/%' ;
#Q> -- SELECT * FROM file_metadata m JOIN file_metadata_commons_file_names c WHERE c.file_metadata_id = m.id AND asset_url LIKE 'https://nasa-external-ocomm.app.box.
com/s/onrtmdvofqluv5ei5kfu5u1pf8v4xqtl/%' ;
#Q> DELETE from file_metadata_commons_file_names WHERE file_metadata_id IN (SELECT id FROM file_metadata WHERE phash = '6dp5qcb22im238nr3wvp0ic7q99w035jmy2iw7i6n43d3
7jtof
#230601 18:14:33 server id 2886731301  end_log_pos 16572552 CRC32 0xb73f861d    Table_map: `s54113__spacemedia`.`file_metadata_commons_file_names` mapped to number 2
1100963

If I'm understanding this correctly, it's a single DELETE that cascades to thousands more deletes. It might just take a long time for the replica to process all of them...

aaaaand... we're back on track :)

Screenshot 2023-06-04 at 19.35.22.png (616×1 px, 112 KB)