Page MenuHomePhabricator

[toolsdb] clouddb1002 stopped replicating from clouddb1001 (again)
Closed, ResolvedPublic

Description

Similar to what happened in T322039 but this time I don't see any error in the logs. It started on Dec 25, 18:00 UTC.

Screenshot 2023-01-04 at 19.20.32.png (1×2 px, 195 KB)

This replica instance had other issues in the past and is probably not 100% consistent. We're currently planning a migration of both the primary and the replica in T301949, but in the meantime it would still be nice to get the replication working again.

SHOW SLAVE STATUS shows the replication is stuck at 8722460 and Seconds_Behind_Master is growing steadily.

MariaDB [(none)]> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: clouddb1001.clouddb-services.eqiad1.wikimedia.cloud
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: log.308899
          Read_Master_Log_Pos: 66943708
               Relay_Log_File: clouddb1002-relay-bin.005765
                Relay_Log_Pos: 8722460
        Relay_Master_Log_File: log.307637
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table: s51412\_\_data.%,s51071\_\_templatetiger\_p.%,s52721\_\_pagecount\_stats\_p.%,s51290\_\_dpl\_p.%,s54518\_\_mw.%
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 8722178
              Relay_Log_Space: 143508734485
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 864619
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 2886731673
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 0-2886731673-33522724637,2886731673-2886731673-3277343457
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
1 row in set (0.00 sec)

Event Timeline

fnegri triaged this task as High priority.Jan 4 2023, 6:27 PM
MariaDB [(none)]> SHOW FULL PROCESSLIST;
+-------+-----------------+--------------------+------+-------------+---------+-----------------------------------------------------------------------+-----------------------+----------+
| Id    | User            | Host               | db   | Command     | Time    | State                                                                 | Info                  | Progress |
+-------+-----------------+--------------------+------+-------------+---------+-----------------------------------------------------------------------+-----------------------+----------+
|     2 | event_scheduler | localhost          | NULL | Daemon      | 2460626 | Waiting on empty queue                                                | NULL                  |    0.000 |
|     4 | system user     |                    | NULL | Connect     | 2460626 | Waiting for master to send event                                      | NULL                  |    0.000 |
|     5 | system user     |                    | NULL | Connect     |       3 | Delete_rows_log_event::find_row(-1)                                   | NULL                  |    0.000 |
| 37254 | repl            | 172.16.4.121:56664 | NULL | Binlog Dump | 1335221 | Master has sent all binlog to slave; waiting for binlog to be updated | NULL                  |    0.000 |
| 81426 | root            | localhost          | NULL | Query       |       0 | init                                                                  | SHOW FULL PROCESSLIST |    0.000 |
+-------+-----------------+--------------------+------+-------------+---------+-----------------------------------------------------------------------+-----------------------+----------+
5 rows in set (0.00 sec)

I tried restarting mariadb on the replica, which didn't help. After the restart:

MariaDB [(none)]> SHOW PROCESSLIST;
+----+-----------------+--------------------+------+-------------+------+-----------------------------------------------------------------------+------------------+----------+
| Id | User            | Host               | db   | Command     | Time | State                                                                 | Info             | Progress |
+----+-----------------+--------------------+------+-------------+------+-----------------------------------------------------------------------+------------------+----------+
|  2 | event_scheduler | localhost          | NULL | Daemon      | 1157 | Waiting on empty queue                                                | NULL             |    0.000 |
|  4 | system user     |                    | NULL | Connect     | 1152 | Queueing master event to the relay log                                | NULL             |    0.000 |
|  5 | system user     |                    | NULL | Connect     |    0 | Delete_rows_log_event::find_row(-1)                                   | NULL             |    0.000 |
|  9 | repl            | 172.16.4.121:56668 | NULL | Binlog Dump | 1118 | Master has sent all binlog to slave; waiting for binlog to be updated | NULL             |    0.000 |
| 13 | root            | localhost          | NULL | Query       |    0 | init                                                                  | SHOW PROCESSLIST |    0.000 |
+----+-----------------+--------------------+------+-------------+------+-----------------------------------------------------------------------+------------------+----------+
5 rows in set (0.00 sec)

According to MDEV-15350, Delete_rows_log_event::find_row(-1) could indicate an issue with a DELETE on a table without a primary key.

System load is stuck at 1 since the replication stopped:

Screenshot 2023-01-05 at 11.39.04.png (1×1 px, 214 KB)

It hasn't stopped replicating, it is executing this transaction on a big table:

MariaDB [(none)]> select count(*) from s52481__stats_global.requests;
+----------+
| count(*) |
+----------+
|  3989342 |
+----------+
1 row in set (2.83 sec)

And they are in fact deletes like:

### DELETE FROM `s52481__stats_global`.`requests`
### WHERE
###   @1=2022-02-12 06:48:20 /* DATETIME meta=0 nullable=1 is_null=0 */
###   @2='en' /* VARSTRING(40) meta=40 nullable=1 is_null=0 */
###   @3='35db8b91' /* VARSTRING(512) meta=512 nullable=1 is_null=0 */
###   @4='139143de' /* VARSTRING(512) meta=512 nullable=1 is_null=0 */
###   @5='https://citationhunt.toolforge.org/en?id=35db8b91&cat=139143de' /* VARSTRING(3072) meta=3072 nullable=1 is_null=0 */
###   @6=1 /* TINYINT meta=0 nullable=1 is_null=0 */
###   @7=200 /* INT meta=0 nullable=1 is_null=0 */
###   @8='https://citationhunt.toolforge.org/en?id=1012da00&cat=139143de' /* VARSTRING(512) meta=512 nullable=1 is_null=0 */
###   @9=NULL /* VARSTRING(512) meta=512 nullable=1 is_null=1 */

That table in fact doesn't have a PK, which will make things a lot slower.

I discussed this with @Marostegui, it's a huge transaction that deleted about 3 million rows on that table.

On the primary:

MariaDB [(none)]> select count(*) from s52481__stats_global.requests;
+----------+
| count(*) |
+----------+
|  1090787 |
+----------+
1 row in set (0.53 sec)

Maybe the replica will eventually catch up but it's been stuck for more than a week. I'm tempted to just skip that transaction, accepting that the replica will become inconsistent. Other tables in the replica are already inconsistent, and we'll have to recreate a new replica anyway, see T301949.

fnegri changed the task status from Open to In Progress.Jan 5 2023, 9:06 PM
fnegri added a subscriber: Ladsgroup.

With some great help from @Ladsgroup I decided for a different approach: I first tried truncating the s52481__stats_global.requests table on the replica, to see if the replica would resume, but it failed with:

Could not execute Delete_rows_v1 event on table s52481__stats_global.requests; Can't find record in 'requests', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log log.307637, end_log_pos 8723300

I then decided to add s52481\_\_stats_global.requests to the list of tables excluded from replication (Replicate_Wild_Ignore_Table). In this way the replication did resume successfully and Seconds_Behind_Master is now decreasing, hurray! It will take several hours to catch up.

I also tried creating a copy of the table .ibd file from the primary, so that I could copy it to the replica and remove the table from Replicate_Wild_Ignore_Table... To generate a consistent .ibd file I tried using the following:

root@clouddb1001:~# mariabackup --backup --target-dir=/srv/labsdb/backup_T326261_test --user root --databases=s52481__stats_global --tables=requests

But it was taking longer than I expected, and I stopped it with Ctrl+C. While it was running I also realized that restoring the .ibd file generated by mariabackup on the replica would probably not work, because it would be a snapshot taken at a specific time, and if I removed the table from Replicate_Wild_Ignore_Table, replication would start from a different time.

For now, I think it's good enough to keep that table in the Replicate_Wild_Ignore_Table exclusion list, we already have other tables that were added there for similar problems.

Change 876011 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] ToolsDB: stop replicating a big problematic table

https://gerrit.wikimedia.org/r/876011

Seconds_Behind_Master is slowly decreasing:

Screenshot 2023-01-06 at 11.30.07.png (1×2 px, 209 KB)

I've just set innodb_flush_log_at_trx_commit=2; on the replica, I expect this will reduce the time it takes for it to catch up with the primary.

The replica is back in sync! 🎉

Screenshot 2023-01-07 at 16.45.04.png (1×2 px, 233 KB)

I have reverted innodb_flush_log_at_trx_commit to 1.

I will only resolve this task after this patch is merged.

Change 876011 merged by FNegri:

[operations/puppet@production] ToolsDB: stop replicating a big problematic table

https://gerrit.wikimedia.org/r/876011