Page MenuHomePhabricator

db1100 replication broken
Closed, ResolvedPublic

Description

root@PRODUCTION s5 slave[(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: db1070.eqiad.wmnet
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: db1070-bin.001487
          Read_Master_Log_Pos: 519354783
               Relay_Log_File: db1100-relay-bin.000031
                Relay_Log_Pos: 487021949
        Relay_Master_Log_File: db1070-bin.001487
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1062
                   Last_Error: Could not execute Update_rows_v1 event on table wikidatawiki.tag_summary; Duplicate entry '355179201' for key 'tag_summary_rev_id', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log db1070-bin.001487, end_log_pos 487021887
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 487021661
              Relay_Log_Space: 519355413
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1062
               Last_SQL_Error: Could not execute Update_rows_v1 event on table wikidatawiki.tag_summary; Duplicate entry '355179201' for key 'tag_summary_rev_id', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log db1070-bin.001487, end_log_pos 487021887
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 171978777
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 180359179-180359179-96523837,0-180359179-5734605861,171978777-171978777-14515902,171974884-171974884-1473084269,171970704-171970704-351094624,171978768-171978768-202416
1 row in set (0.00 sec)

root@PRODUCTION s5 slave[(none)]>

Event Timeline

This server crashed once already: T175973 so I believe a full rebuild is probably a good idea.
As I have to build a couple of hosts for s5 on Monday, I will rebuild this one too.

Row deleted:

+----------+-----------+-----------+-----------+----------------+
| ts_id    | ts_rc_id  | ts_log_id | ts_rev_id | ts_tags        |
+----------+-----------+-----------+-----------+----------------+
| 69608570 | 368769825 |      NULL | 355179201 | OAuth CID: 378 |
+----------+-----------+-----------+-----------+----------------+
1 row in set (0.00 sec)

Same on db2023. Tag summary is not a reliable replicated table- it has differences on every wiki we checked (See other tickets checking inconsistencies between servers). That is most likely not a server issue but a query issue.

Marostegui assigned this task to jcrespo.

Resolving for now...we'll see if it happens again on different servers too

Mentioned in SAL (#wikimedia-operations) [2017-11-20T06:25:12Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Restore original weights for db1100 and db1071 - T180917 (duration: 00m 49s)