Page MenuHomePhabricator

[toolsdb] ToolsDB replication is broken on tools-db-2 (errno 1032) - 2023-08-17
Closed, ResolvedPublic

Description

From alertmanager:

ToolsToolsDBReplication
1
summary: ToolsDB replication is broken on tools-db-2 (errno 1032)
9 hours agoinstance: tools-db-2
job: toolsdb-mariadb
master_host: tools-db-1.tools.eqiad1.wikimedia.cloud
crit
team: wmcs
@cluster: wmcloud.org
@receiver: metricsinfra_cloud-feed
runbook

Looking

Details

Event Timeline

dcaro changed the task status from Open to In Progress.
dcaro triaged this task as High priority.
dcaro moved this task from To refine to Doing on the User-dcaro board.

So the replication is failing trying to delete things from a specific table:

dcaro@urcuchillay$ wm-ssh tools-db-2.tools.eqiad1.wikimedia.cloud
...
dcaro@tools-db-2:~$ sudo mariadb
...
MariaDB [(none)]> SHOW SLAVE STATUS\G
...
Last_Error: Could not execute Delete_rows_v1 event on table s51698__yetkin.wanted_items; Can't find record in 'wanted_items', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log log.029381, end_log_pos 81558548

Looking at that table on the replica, that table is empty yes:

MariaDB [s51698__yetkin]> select count(*) from wanted_items;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.004 sec)

The primary still has many left:

MariaDB [s51698__yetkin]> select count(*) from wanted_items;
+----------+
| count(*) |
+----------+
|  1119210 |
+----------+
1 row in set (0.000 sec)

Hmm... this makes me think that somehow we missed the data, and even if we skip this transaction, it will probably break again as soon as that table tries to delete a row :/

I'll try skipping that transaction, but we might want to rebuild the replica from the primary.

I added that table to the skip list (on the replica):

MariaDB [mysql]> set global Replicate_Wild_Ignore_Table='s51698\_\_yetkin.wanted\_items';

and started the slave:

MariaDB [mysql]> start slave;

and things seem to be moving forward:

MariaDB [mysql]> show slave status \G
*************************** 1. row ***************************
                Slave_IO_State: Queueing master event to the relay log
                   Master_Host: tools-db-1.tools.eqiad1.wikimedia.cloud
...
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
...
         Seconds_Behind_Master: 37828  <- this is going down
    Slave_Transactional_Groups: 810393803
...
1 row in set (0.001 sec)

Will start working on repopulating the table.

Change 949854 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] toolsdb: add skipped table to the config

https://gerrit.wikimedia.org/r/949854

Change 949854 merged by FNegri:

[operations/puppet@production] toolsdb: add skipped table to the config

https://gerrit.wikimedia.org/r/949854

I think this is the perfect motivation to test the procedure to create a new replica: T344717.

#

I think this is already sorted out, please reopen if I forgot something.

dcaro moved this task from Doing to Done on the User-dcaro board.