Page MenuHomePhabricator

Recent duplicate entries on change_tag on sanitarium hosts
Closed, ResolvedPublic

Description

A few days ago I saw a duplicate entry replication on one of the sanitarium hosts (reminder: they use row base replication) for frwiki database (s6) which matched when the new writes to change_tag happened.
I thought it was just a one time thing and I rebuilt the table for that one.

However, today I have seen it happening again on bgwiki (s2) for both sanitarium hosts in eqiad and codfw for different entries even:

On db1125: Last_Error: Could not execute Update_rows_v1 event on table bgwiki.change_tag; Duplicate entry '6114880-visualeditor' for key 'change_tag_rev_tag', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log db1074-bin.003024, end_log_pos 818204967

And on db2095: Last_SQL_Error: Could not execute Update_rows_v1 event on table bgwiki.change_tag; Duplicate entry '3964232-CategoryMaster' for key 'change_tag_log_tag', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log db2063-bin.004170, end_log_pos 638399139

I am pretty sure we fixed all the inconsistencies on change_tag table (T161510 T160509) but I could be wrong, however, it is very suspicious that this shows up right after we are starting to write more to change_tag table everywhere (https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/446804/) as it has been like this for more than a year without any issues // cc @Ladsgroup

We should go ahead and do a general check on the change_tag table across all the wikis again.

As bgwiki.change_tag was pretty small I have rebuilt it on eqiad to make sure we don't get labs hosts affected for long. For codfw host we can, however, take our time and check what's wrong.

Update

Looks like there were some inconsistences left on bgwiki.change_tag and frwiki.change_tag that were not fixed when we did the massive clean up as for sometime we were using pt-table-checksum which uses a PK and change_tag didn't have a PK at the time. Until Jaime developed compare.py we excluded change_tag for some sections. We should check across sections for a quick round of compare.py for change_tag

Wikis with inconsistencies detected and pending to be fixed on change_tag table:

  • s1 is fine - no differences detected
  • bgwiki (s2) T154485
  • cswiki (s2)
  • enwiktionary (s2)
  • eowiki (s2)
  • fiwiki (s2)
  • idwiki (s2)
  • itwiki (s2)
  • nlwiki (s2)
  • nowiki (s2)
  • plwiki (s2)
  • ptwiki (s2)
  • svwiki (s2)
  • thwiki (s2)
  • trwiki (s2)
  • zhwiki (s2)
  • ckbwiki (s3)
  • dawiki (s3)
  • elwiki (s3)
  • eswikivoyage (s3)
  • euwiki (s3)
  • fawikivoyage (s3)
  • glwiki (s3)
  • itwikiversity (s3)
  • jvwiki (s3)
  • kshwiki (s3)
  • mediawikiwiki (s3)
  • mswiki (s3)
  • orwiki (s3)
  • outreachwiki (s3)
  • ruwiktionary (s3)
  • skwiki (s3)
  • specieswiki (s3)
  • uawikimedia (s3)
  • urwiki (s3)
  • zhwikisource (s3)
  • commonswiki (s4)
  • s5 is fine - no differences detected
  • s7 is fine - no differences detected
  • s8 is fine - no differences detected

Event Timeline

@Ladsgroup is it possible to disable this feature until we better understand what's going on?

Mentioned in SAL (#wikimedia-operations) [2018-07-20T07:46:01Z] <marostegui> Stop replicaiton on s8 codfw master - T200061

As per my IRC chat with Amir, I am checking consistency on wikidata.
He will elaborate a bit more once he gets to the office on what the possible causes may be and why frwiki and bgwiki could have shown this so far.

Mentioned in SAL (#wikimedia-operations) [2018-07-20T08:47:11Z] <Amir1> stopping populateChangeTagDef.php (T200061)

Mentioned in SAL (#wikimedia-operations) [2018-07-20T08:50:38Z] <marostegui> Stop replication on s7 codfw master to check arwiki.change_tag - T200061

I started the script to back populate change_tag_def from all.dblist and it was doing it on alphabetical order (until I stopped it five minutes ago on bowiki) and several days ago I ran it three groups of 'wikisource', 'wikivoyage', 'wikibooks' plus 'frwiki'. OTOH, This feature is enabled everywhere so if there is any problem in the feature it should show itself on for example wikidatawiki (which is pretty big) so I think it's because of the migration script and not the feature.

Timeline:

frwiki.change_tag broke ONLY on codfw sanitarium, which is weird as they have the same data (codfw populated eqiad ones actually).
bgwiki.change_tag breaks on eqiad and codfw with different keys even (check the task description)
wikidatawiki.change_tag on codfw showed no differences when comparing tables (script didn't run here yet)
arwiki.change_tag on codfw showed no differences when comparing tables (script did run here, but the table is tiny, 500MB).

Could it be some sort of race condition on the script?

Regarding bgwiki, I'm pretty sure it's because of the script revision '6114880' in bgwiki is made in 2014 (and we don't add new tags for old revisions even it's technically possible) and the logging action 3964232 was made in 2016.

For db2095:

This is what was being UPDATED and failed:

Last_SQL_Error: Could not execute Update_rows_v1 event on table bgwiki.change_tag; Duplicate entry '3964232-CategoryMaster' for key 'change_tag_log_tag', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log db2063-bin.004170, end_log_pos 638399139
UPDATE `bgwiki`.`change_tag`
### WHERE
###   @1=185514 /* INT meta=0 nullable=0 is_null=0 */
###   @2=13111279 /* INT meta=0 nullable=1 is_null=0 */
###   @3=3964232 /* INT meta=0 nullable=1 is_null=0 */
###   @4=NULL /* INT meta=0 nullable=1 is_null=1 */
###   @5='CategoryMaster' /* VARSTRING(255) meta=255 nullable=0 is_null=0 */
###   @6=NULL /* BLOB/TEXT meta=2 nullable=1 is_null=1 */
###   @7=NULL /* INT meta=0 nullable=1 is_null=1 */
### SET
###   @1=185514 /* INT meta=0 nullable=0 is_null=0 */
###   @2=13111279 /* INT meta=0 nullable=1 is_null=0 */
###   @3=3964232 /* INT meta=0 nullable=1 is_null=0 */
###   @4=NULL /* INT meta=0 nullable=1 is_null=1 */
###   @5='CategoryMaster' /* VARSTRING(255) meta=255 nullable=0 is_null=0 */
###   @6=NULL /* BLOB/TEXT meta=2 nullable=1 is_null=1 */
###   @7=1 /* INT meta=0 nullable=1 is_null=0 */

This is how that row looks like:

root@db2095.codfw.wmnet[bgwiki]> select * from change_tag where ct_log_id=3964232;
+--------+----------+-----------+-----------+----------------+-----------+-----------+
| ct_id  | ct_rc_id | ct_log_id | ct_rev_id | ct_tag         | ct_params | ct_tag_id |
+--------+----------+-----------+-----------+----------------+-----------+-----------+
| 185515 | 13111279 |   3964232 |      NULL | CategoryMaster | NULL      |      NULL |
+--------+----------+-----------+-----------+----------------+-----------+-----------+
1 row in set (0.04 sec)

And we have this UNIQUE key on the table

UNIQUE KEY `change_tag_log_tag` (`ct_log_id`,`ct_tag`),
root@db2095.codfw.wmnet[bgwiki]> select * from change_tag where ct_log_id=3964232 and ct_tag='CategoryMaster';
+--------+----------+-----------+-----------+----------------+-----------+-----------+
| ct_id  | ct_rc_id | ct_log_id | ct_rev_id | ct_tag         | ct_params | ct_tag_id |
+--------+----------+-----------+-----------+----------------+-----------+-----------+
| 185515 | 13111279 |   3964232 |      NULL | CategoryMaster | NULL      |      NULL |
+--------+----------+-----------+-----------+----------------+-----------+-----------+
1 row in set (0.04 sec)

This is extremely weird:

wikiadmin@10.64.48.34(bgwiki)> select * from change_tag where ct_log_id = 3964232;
+--------+----------+-----------+-----------+----------------+-----------+-----------+
| ct_id  | ct_rc_id | ct_log_id | ct_rev_id | ct_tag         | ct_params | ct_tag_id |
+--------+----------+-----------+-----------+----------------+-----------+-----------+
| 185515 | 13111279 |   3964232 |      NULL | CategoryMaster | NULL      |        10 |
+--------+----------+-----------+-----------+----------------+-----------+-----------+
1 row in set (0.00 sec)
wikiadmin@10.64.48.34(bgwiki)> select * from change_tag where ct_id = 185514;
+--------+----------+-----------+-----------+--------------+-----------+-----------+
| ct_id  | ct_rc_id | ct_log_id | ct_rev_id | ct_tag       | ct_params | ct_tag_id |
+--------+----------+-----------+-----------+--------------+-----------+-----------+
| 185514 | 13111280 |      NULL |   7240978 | visualeditor | NULL      |         1 |
+--------+----------+-----------+-----------+--------------+-----------+-----------+
1 row in set (0.00 sec)

It basically swapped the rows and tries to update the wrong row :/

Mentioned in SAL (#wikimedia-operations) [2018-07-20T09:46:24Z] <marostegui> Stop replication on db2063 to fix db2095 - T200061

Maybe that is a leftover from all the work done on T154485? It wouldn't be strange as we checksummed thousands of tables so stuff could have slipped thru and the script hit it?

/home/marostegui/section s2 | grep -v db2095  | while read host port; do echo "===$host:$port==="; mysql.py -h$host -P$port bgwiki -e "select * from change_tag where ct_log_id = 3964232;" -BN ; done
===labsdb1011.eqiad.wmnet:3306===
185515	13111279	3964232	NULL	CategoryMaster	NULL	10
===labsdb1010.eqiad.wmnet:3306===
185515	13111279	3964232	NULL	CategoryMaster	NULL	10
===labsdb1009.eqiad.wmnet:3306===
185515	13111279	3964232	NULL	CategoryMaster	NULL	10
===dbstore2002.codfw.wmnet:3312===
185514	13111279	3964232	NULL	CategoryMaster	NULL	1
===dbstore2001.codfw.wmnet:3312===
185514	13111279	3964232	NULL	CategoryMaster	NULL	1
===dbstore1002.eqiad.wmnet:3306===
185430	13111279	3964232	NULL	CategoryMaster	NULL	10
===db2091.codfw.wmnet:3312===
185514	13111279	3964232	NULL	CategoryMaster	NULL	1
===db2088.codfw.wmnet:3312===
185514	13111279	3964232	NULL	CategoryMaster	NULL	1
===db2063.codfw.wmnet:3306===
185514	13111279	3964232	NULL	CategoryMaster	NULL	1
===db2056.codfw.wmnet:3306===
185514	13111279	3964232	NULL	CategoryMaster	NULL	1
===db2049.codfw.wmnet:3306===
185514	13111279	3964232	NULL	CategoryMaster	NULL	1
===db2041.codfw.wmnet:3306===
185514	13111279	3964232	NULL	CategoryMaster	NULL	1
===db2035.codfw.wmnet:3306===
185514	13111279	3964232	NULL	CategoryMaster	NULL	1
===db1125.eqiad.wmnet:3312===
185515	13111279	3964232	NULL	CategoryMaster	NULL	10
===db1122.eqiad.wmnet:3306===
185515	13111279	3964232	NULL	CategoryMaster	NULL	10
===db1105.eqiad.wmnet:3312===
185514	13111279	3964232	NULL	CategoryMaster	NULL	1
===db1103.eqiad.wmnet:3312===
185514	13111279	3964232	NULL	CategoryMaster	NULL	1
===db1090.eqiad.wmnet:3312===
185515	13111279	3964232	NULL	CategoryMaster	NULL	10
===db1076.eqiad.wmnet:3306===
185515	13111279	3964232	NULL	CategoryMaster	NULL	10
===db1074.eqiad.wmnet:3306===
185515	13111279	3964232	NULL	CategoryMaster	NULL	10
===db1066.eqiad.wmnet:3306===
185515	13111279	3964232	NULL	CategoryMaster	NULL	10

As this table is really small for bgwiki, I am going to work on it to fix the inconsistencies and will do the same with frwiki (as it is small there too).

Change 447050 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1090:3312 and db1103:3312

https://gerrit.wikimedia.org/r/447050

Change 447050 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1090:3312 and db1103:3312

https://gerrit.wikimedia.org/r/447050

Mentioned in SAL (#wikimedia-operations) [2018-07-20T10:14:26Z] <marostegui> Stop replication in sync on db1090:3312 and db1103:3312 - T200061

Mentioned in SAL (#wikimedia-operations) [2018-07-20T10:17:46Z] <marostegui> Stop replication in sync on db1090:3312 and db2035 - T200061

Change 447051 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool db1103:3312, depool db1105:3312

https://gerrit.wikimedia.org/r/447051

Change 447051 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1103:3312, depool db1105:3312

https://gerrit.wikimedia.org/r/447051

Mentioned in SAL (#wikimedia-operations) [2018-07-20T10:30:35Z] <marostegui> Stop replication in sync on db1090:3312 and db1105:3312 - T200061

So this is what we have in the backups for that particular entry (this is from dbstore2002 as that is the backup source for this wiki):
backup from: 2018-07-04

(185514,13111279,3964232,NULL,"CategoryMaster",NULL,NULL),
(185515,13111280,NULL,7240978,"visualeditor",NULL,NULL),

backup from: 2018-07-10

(185514,13111279,3964232,NULL,"CategoryMaster",NULL,NULL),
(185515,13111280,NULL,7240978,"visualeditor",NULL,NULL),

backup from: 2018-07-18

(185514,13111279,3964232,NULL,"CategoryMaster",NULL,NULL),
(185515,13111280,NULL,7240978,"visualeditor",NULL,NULL),

So the rows were the same at least for a month, so maybe this is a leftover from the inconsistencies clean up for s2 (T154485) and the script faced it.
I have fixed bgwiki.change_tag already

And I will start with frwiki.change_tag which are the ones where the script failed.

We should probably do a quick check across all the wikis to make sure there are not many change_tag leftovers to clean up before re-running the script. Luckily with compare.py that shouldn't be too hard.

Change 447053 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool db1090:3312, db1105:3312

https://gerrit.wikimedia.org/r/447053

Change 447053 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1090:3312, db1105:3312

https://gerrit.wikimedia.org/r/447053

I think I know the history behind those inconsistencies on s2 and s6.
Those two sections were done when we were relaying on pt-table-checksum which cannot work without PK on the table. And change_tag didn't have a PK at the time, so they were excluded.

Change 447061 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1096:3316,db1085

https://gerrit.wikimedia.org/r/447061

Change 447061 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1096:3316,db1085

https://gerrit.wikimedia.org/r/447061

Change 447062 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool db1096:3316, depool db1098:3316

https://gerrit.wikimedia.org/r/447062

Change 447062 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1096:3316, depool db1098:3316

https://gerrit.wikimedia.org/r/447062

Change 447069 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Repool db1098:3316, depool db1113:3316

https://gerrit.wikimedia.org/r/447069

Change 447069 merged by Marostegui:
[operations/mediawiki-config@master] db-eqiad.php: Repool db1098:3316, depool db1113:3316

https://gerrit.wikimedia.org/r/447069

@Ladsgroup feel free to run the script on the rest of the s6 wikis (jawiki, ruwiki) I have fixed the table there (and also fixed frwiki, although the script already ran for that wiki.

On Monday I will run a check on all the wikis in s2

Maybe that is a leftover from all the work done on T154485? It wouldn't be strange as we checksummed thousands of tables so stuff could have slipped thru and the script hit it?

pt-table-checksum doesn't work on ROW based replication- because it write the master's result, not the different row one. If change_tag didn't have a primary key at the time (or even if it had, but unsafe statements were ran, it is quite possibly that it generates differences that were not really detected.

When we ran pt-table-checksum we excluded tables without PK like change_tag at the time

@Ladsgroup the following section have been checked and have reported no differences, so you can go ahead and run the script for them if you wish

s1
s5
s6 (already fixed)
s7
s8

I will fix the differences on s2, s3 (just a bunch of wikis), s4

Change 447388 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1074, db1105:3312

https://gerrit.wikimedia.org/r/447388

Change 447388 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1074, db1105:3312

https://gerrit.wikimedia.org/r/447388

@Ladsgroup s2 is also fixed, you can run it there too

Change 447419 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1081, db1097:3314

https://gerrit.wikimedia.org/r/447419

Change 447419 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1081, db1097:3314

https://gerrit.wikimedia.org/r/447419

Change 447574 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1081, db1121

https://gerrit.wikimedia.org/r/447574

Change 447574 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1081, db1121

https://gerrit.wikimedia.org/r/447574

@Ladsgroup All the affected wikis on s3 are fixed. You may run the script on s3 now.

@Ladsgroup s4 is also fixed.
That means all the reported wikis on the above list are now fixed!