Page MenuHomePhabricator

Duplicate key on several s8 replicas breaking replication
Closed, ResolvedPublic

Description

[2018-11-04 23:26:58] SERVICE ALERT: db1109;MariaDB Slave SQL: s8;CRITICAL;HARD;3;CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error 'Duplicate entry '745452474-1295751' for key 'PRIMARY'' on query. Default database: 'wikidatawiki'. [Query snipped]
Service Critical[2018-11-04 23:26:54] SERVICE ALERT: db1099;MariaDB Slave SQL: s8;CRITICAL;HARD;3;CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error 'Duplicate entry '745452474-1295751' for key 'PRIMARY'' on query. Default database: 'wikidatawiki'. [Query snipped]
Service Critical[2018-11-04 23:26:33] SERVICE ALERT: db1104;MariaDB Slave SQL: s8;CRITICAL;HARD;3;CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error 'Duplicate entry '745452474-1295751' for key 'PRIMARY'' on query. Default database: 'wikidatawiki'. [Query snipped]
Service Critical[2018-11-04 23:26:28] SERVICE ALERT: db1116;MariaDB Slave SQL: s8;CRITICAL;HARD;3;CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error 'Duplicate entry '745452474-1295751' for key 'PRIMARY'' on query. Default database: 'wikidatawiki'. [Query snipped]
Service Critical[2018-11-04 23:26:23] SERVICE ALERT: db1092;MariaDB Slave SQL: s8;CRITICAL;HARD;3;CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error 'Duplicate entry '745452474-1295751' for key 'PRIMARY'' on query. Default database: 'wikidatawiki'. [Query snipped]
Service Critical[2018-11-04 23:26:23] SERVICE ALERT: db2045;MariaDB Slave SQL: s8;CRITICAL;HARD;3;CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error 'Duplicate entry '745452474-1295751' for key 'PRIMARY'' on query. Default database: 'wikidatawiki'. [Query snipped]
Service Critical[2018-11-04 23:25:43] SERVICE ALERT: db1101;MariaDB Slave SQL: s8;CRITICAL;HARD;3;CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error 'Duplicate entry '745452474-1295751' for key 'PRIMARY'' on query. Default database: 'wikidatawiki'. [Query snipped]

Event Timeline

The table is 'revision_comment_temp'

Now I am running

./compare.py wikidatawiki revision_comment_temp revcomment_rev db1071.eqiad.wmnet db1104.eqiad.wmnet

To see if there any more difference in that table between master and one of the broken slaves

Banyek moved this task from Triage to In progress on the DBA board.
Addshore moved this task from incoming to monitoring on the Wikidata board.
Addshore subscribed.

the comparison says there are no differences

Banyek lowered the priority of this task from High to Medium.Nov 5 2018, 12:19 PM

as the comparison says the table is ok, I triage it as 'normal'

The corresponding revision-table row is timestamped 2018-09-13T09:08:17Z. $wgCommentTableSchemaMigrationStage should have been WRITE_BOTH since February 2018, so the revision_comment_temp row should have already existed, so the migration script shouldn't have been having to insert it in the first place.

I note that 2018-09-13T09:08:17Z is right at the beginning of the range mentioned in T206743#4658274. Perhaps it got missed somehow when cleaning up after T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared")?

it got missed somehow

How can that be?, the table was empty or clean at that time according to my notes-and this was verified by 2 people independently (Manuel and me) on 2 different servers as I have on my notes on the 16 oct, and later at T206743#4691048 , unless I am making a huge mistake or missing something. And even if it wasn't and somehow 2 people independently skipped it- the different edits added more than 1 comment, but apparently was 1 row was different?

Could an interaction with archive create issues- eg. the undeletion of a revision so it gets moved from archive to revision and archive is know to been unreliable in the past?

Could an interaction with archive create issues- eg. the undeletion of a revision so it gets moved from archive to revision and archive is know to been unreliable in the past?

The page in question has no entries in the logging table, if it were deleted and undeleted I'd expect to see corresponding entries there. Although it could be something weird from Wikibase, but still why only the one row and not any others before or after?

Q123507, specifically this revision.

What happened is that one row in the revision_comment_temp table was missing on the master (db1071, I believe) but was present on several replicas. And apparently it was only the one row.

Marostegui subscribed.

I am going to consider this resolved.