db2094 s3 replication broke
Closed, ResolvedPublic

Description

Update_rows_v1 event on table cawikimedia.archive: Duplicate entry '1291' for key 'ar_revid_uniq', Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the event's master log db2074-bin.001934, end_log_pos 695355661

It broke last night at db2074-bin.001934:695355241.

Maybe all of s3 codfw has issues, but only showing up at the ROW replication host? Alternatively, only occurring at the only host using filters and 900+ wikis.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 2 2018, 8:12 AM
jcrespo triaged this task as High priority.Nov 2 2018, 8:12 AM
jcrespo moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2018-11-02T11:43:19Z] <jynus> ignoring cawikimedia.archive replication on db2094:s3 until a reimport happens T208565

Mentioned in SAL (#wikimedia-operations) [2018-11-02T15:00:55Z] <jynus> stopping replication on db2074 to fix db2094:s3 T208565

Mentioned in SAL (#wikimedia-operations) [2018-11-02T15:12:18Z] <jynus> restarting replication @ db2074 after db2094:s3 table fix T208565

jcrespo closed this task as Resolved.Nov 2 2018, 3:14 PM

This is technically fixed, but we should do a deeper check on the causes of this, there could be some drift on this or other close dbs that only manifests due to the ROW-based replication.