Page MenuHomePhabricator

db2057 replication broken due to delete on echo_notification and echo_event
Closed, ResolvedPublic

Description

Creating this task for tracking issues:
db2057 replication got broken several times over night with deletes going to testwiki.echo_notification:

root@PRODUCTION s3[(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: db2018.codfw.wmnet
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: db2018-bin.003183
          Read_Master_Log_Pos: 820062111
               Relay_Log_File: db2057-relay-bin.000533
                Relay_Log_Pos: 540151115
        Relay_Master_Log_File: db2018-bin.003182
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1146
                   Last_Error: Error 'Table 'testwiki.echo_notification' doesn't exist' on query. Default database: 'testwiki'. Query: 'delete from echo_notification where notification_event =2293'
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 540150827
              Relay_Log_Space: 1868643849
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1146
               Last_SQL_Error: Error 'Table 'testwiki.echo_notification' doesn't exist' on query. Default database: 'testwiki'. Query: 'delete from echo_notification where notification_event =2293'
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 180359174
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 0-171966669-4032579870,180359174-180359174-94123433,171966669-171966669-215805333

I skipped that query, and it broke again with the same query but for echo_event table this time, which also doesn't exist.

I checked the binlogs from the master and found (starting at around 21:30:22 UTC):

delete from echo_notification where notification_event=53922
delete from echo_event where event_id=53922
delete from echo_notification where notification_event =2293
delete from echo_event where event_id=2293

So what I have done is create those two tables on db2057 (they do exist on all the other hosts of s3).

Those lowercase delete look like they were done manually maybe? I have checked in SAL, gerrit and phabricator but couldn't find anything related to it

Event Timeline

Marostegui claimed this task.

I am resolving this because replication is no longer broken.