Page MenuHomePhabricator

labsdb1004 replication broken for linkwatcher_linklog table
Closed, ResolvedPublic

Description

labsdb1004 has gotten replication broken for 24h (reported on tendril).
It needs fixing.

181204 16:31:17 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table s51230__linkwatcher.linkwatcher_linklog; Can't find record in 'linkwatcher_linklog', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log log.235250, end_log_pos 515088053, Gtid 0-171975945-18029787259, Internal MariaDB error code: 1032
181204 16:34:10 [Warning] Slave: Can't find record in 'linkwatcher_linklog' Error_code: 1032
181204 16:34:10 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'log.235250' position 74401248; GTID position '0-171975945-18029787258'
181204 16:34:10 [Note] Slave SQL thread exiting, replication stopped in log 'log.235250' at position 74401248; GTID position '0-171975945-18029787258'

Event Timeline

I don't find the corresponding event in the master binary log:

The output of SHOW SLAVE STATUS on labsdb1004 is the following:

MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: labsdb1005.eqiad.wmnet
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: log.235477
          Read_Master_Log_Pos: 54667211
               Relay_Log_File: labsdb1004-relay-bin.009814
                Relay_Log_Pos: 74401529
        Relay_Master_Log_File: log.235250
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table: s51412\_\_data.%,s51071\_\_templatetiger\_p.%,s52721\_\_pagecount\_stats\_p.%,s51290\_\_dpl\_p.%
                   Last_Errno: 1032
                   Last_Error: Could not execute Delete_rows_v1 event on table s51230__linkwatcher.linkwatcher_linklog; Can't find record in 'linkwatcher_linklog', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log log.235250, end_log_pos 515088053
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 74401248
              Relay_Log_Space: 30110616666
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1032
               Last_SQL_Error: Could not execute Delete_rows_v1 event on table s51230__linkwatcher.linkwatcher_linklog; Can't find record in 'linkwatcher_linklog', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log log.235250, end_log_pos 515088053
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 171975945
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 0-171975945-18043035169
1 row in set (0.00 sec)

On labsdb1005:

MariaDB [(none)]> show binlog events in 'log.235250' FROM 74401248 LIMIT 1\G;
*************************** 1. row ***************************
   Log_name: log.235250
        Pos: 74401248
 Event_type: Gtid
  Server_id: 171975945
End_log_pos: 74401286
       Info: BEGIN GTID 0-171975945-18029787259
1 row in set (0.00 sec)

ERROR: No query specified

MariaDB [(none)]> quit
Bye
root@labsdb1005:~/work# mysqlbinlog log.235250 --start-position 74401248 --stop-position 74401286    -v --base64-output=DECODE-ROWS
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 74401248
#181204 13:28:55 server id 171975945  end_log_pos 74401286 	GTID 0-171975945-18029787259
/*!100001 SET @@session.gtid_domain_id=0*//*!*/;
/*!100001 SET @@session.server_id=171975945*//*!*/;
/*!100001 SET @@session.gtid_seq_no=18029787259*//*!*/;
BEGIN
/*!*/;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

From what I can see it is part of a huge transaction - just ignore that table on the replication filters, let it replicate until it is in sync again and then reimport the table from the master.
We have a script for it, check reimport_from_master.sh on labsdb1004 on my /home.

I ignore the table with

SET SESSION SQL_LOG_BIN=0; 
SET GLOBAL Replicate_Wild_Ignore_Table='s51412\_\_data.%,s51071\_\_templatetiger\_p.%,s52721\_\_pagecount\_stats\_p.%,s51290\_\_dpl\_p.%,s51230\_\_linkwatcher.linkwatcher\_linklog';

and start slave.
Once it catch-up I'll do the reimport

Banyek lowered the priority of this task from High to Medium.Dec 5 2018, 7:28 PM

The replication is caught up, but I'll wait with the reimport, because I want to clarify a few things beforehand

The operation would be:

  • reimport the data with the /home/marostegui/reimport_from_master.sh script. (It will take time as the table is ~400G)
  • when it's done, I'll remove the replication filter from the s51230__linkwatcher.linkwatcher_linklog table with

SET SESSION SQL_LOG_BIN=0; SET GLOBAL Replicate_Wild_Ignore_Table='s51412\_\_data.%,s51071\_\_templatetiger\_p.%,s52721\_\_pagecount\_stats\_p.%,s51290\_\_dpl\_p.%,';

  • restart the slave with START SLAVE\G

hm, I leave this as-is as I am not sure which user to use

You should use root to run the script.
If the table is that big you can either reimport it entirely or just diff the table and fix the inconsistencies manually.

Marostegui raised the priority of this task from Medium to High.Dec 6 2018, 6:52 PM

Mentioned in SAL (#wikimedia-operations) [2018-12-07T09:40:51Z] <banyek> importing back linkwatcher_linklog into database s51230__linkwatcher on host labsdb1004.eqiad.wmnet. - T211210

The table is too huge to easily find the inconsistencies (a SELECT COUNT(*) ... command took as much time I stopped it) so I decided to reimport everything. The script is started, running on labsdb1004 in a screen named T211210. I'll take attention on it.

You could've used compare.py, it might have taken several hours though.

You could've used compare.py, it might have taken several hours though.

so far I just used it to see if there are diffs, not for finding those

You could've used compare.py, it might have taken several hours though.

so far I just used it to see if there are diffs, not for finding those

That is the point, no? If you find the diffs you can fix them :)

Importing back the table is still in progress after ~1 day. The file is ~137G currently, on the source it is 384G.
The pv tool estimates another 23 hours. I'll check it again in the future.

Mentioned in SAL (#wikimedia-operations) [2018-12-10T17:59:31Z] <banyek> restarting mysql instance on labsdb1004 to restore replication filters to the original state - T211210

aand after I restarted the instance, I've got:

Last_SQL_Error: Could not execute Write_rows_v1 event on table s51230__linkwatcher.linkwatcher_linklog; Duplicate entry '573509232' for key 'ID', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log log.235715, end_log_pos 90204102

I add the table again to the ignore list, and restart replication, we'll see what to do after it catched up

aand after I restarted the instance, I've got:

Last_SQL_Error: Could not execute Write_rows_v1 event on table s51230__linkwatcher.linkwatcher_linklog; Duplicate entry '573509232' for key 'ID', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log log.235715, end_log_pos 90204102

I add the table again to the ignore list, and restart replication, we'll see what to do after it catched up

That should've not been done. It took 3 days to re-import the table. You should've checked if the duplicate entry was indeed real or not.
I have stopped replication but we might have a gap now again of transactions missed.

I was thinking we should let it catch up, and then redo the import, but with mydumper instead of mysqldump as we were talking about that
Also it's weird why it happened again

I was thinking we should let it catch up, and then redo the import, but with mydumper instead of mysqldump as we were talking about that
Also it's weird why it happened again

It is weird but it should've been checked why it happened, not just ignore the table and try again. If this was a small table, maybe...but giving that it will take 3 days again, it could happen again after 3 days.
As I said earlier on IRC, mydumper/myloader might speed things up, but it won't give us a lighting-fast import, as these hosts are old and have HDDs.

I am going to try to see if I can replay the transactions that have been skipped.
So far I see transactions log.235718 up to log.235721, the rest of binlogs up to log.235728 (which is the one I stopped replication at) look clean.

I think I have replayed all the transactions that were skipped during that time. Of course, going thru binlogs is hard and tedious and I might have missed transactions, so far replication is flowing again.
We'll see in the next few hours/days once DELETEs or UPDATEs arrive to purge old rows.

Marostegui claimed this task.

labsdb1004 finally caught up and so far everything is fine.
The table seems to be in sync.

for i in labsdb1004 labsdb1005; do echo $i; mysql.py -h$i s51230__linkwatcher -e " select max(id) from linkwatcher_linklog;";done
labsdb1004
+-----------+
| max(id)   |
+-----------+
| 574664355 |
+-----------+
labsdb1005
+-----------+
| max(id)   |
+-----------+
| 574664355 |
+-----------+

Closing this for now.