mariadb multi-source replication glitch with site_identifiers
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Springle
	Jul 23 2015, 6:06 AM

Description

Both dbstore200[12] boxes in codfw saw replication for s7 stop with duplicate key error:

Error 'Duplicate entry 'interwiki-fa' for key 'site_ids_type'' on query. Default database: 'hewiki'. Query: 'INSERT /* DBSiteStore::saveSites  */  INTO `site_identifiers` (si_site,si_type,si_key) VALUES ('203','interwiki','fa')'

It is genuine problem. Table is InnoDB and key does indeed exist, so engine is justified in throwing the error. However, none of the other InnoDB slaves were affected, nor the dbstore100[12] boxes in eqiad, which also use multi-source replication. We have seen similar issues before, but with TokuDB bugs, not InnoDB. Misdiagnosis previously? Else, wtf?

DBSiteStore::saveSites job does a large DELETE first:

DELETE /* DBSiteStore::saveSites  */ FROM `site_identifiers` WHERE si_site IN ('1','2','3','4','5','6','7','8','9', ... '874','875','876','877','878','882','890','879','883');

... then INSERTs records one-by-one:

INSERT /* DBSiteStore::saveSites  */  INTO `site_identifiers` (si_site,si_type,si_key) VALUES ('203','interwiki','fa');

Theories/questions:

Somehow DELETE failed or was skipped? Yet only specific records were affected, not all ~800, so transaction must have executed. Single-threaded slaves were ok too, so maybe multi-source related? But why dbstore100[12] not affected? Check through list https://mariadb.atlassian.net/browse/MDEV-253?jql=text%20~%20%22multi-source%22

dbstore200[12] are 10.0 multi-source slaves that also replicate from 10.0 master (dbstore100[12] are 10.0 but replicate from eqiad 5.5 masters). Interestingly, labsdb100x boxes are the only other 10.0 multi-source slaves with 10.0 master db1069, and they have seen replication issues that were previously put down to TokuDB. Not so?

dbstore s7 has showed strange problems recently (T104471) where data simply stopped replicating, though SQL/IO threads continued to run without error. That seemed related to replication rules on s7 which are "special" and include centralauth.%. Coincidence?

Related Objects
Search...

Status	Assigned	Task
Resolved	• jcrespo	T85266 Look into Maria 10 parallel-replication
Resolved	• jcrespo	T105135 Implement mariadb 10.0 masters
Resolved	• jcrespo	T106647 mariadb multi-source replication glitch with site_identifiers

Event Timeline

• Springle created this task.Jul 23 2015, 6:06 AM

• Springle claimed this task.

• Springle raised the priority of this task from to Needs Triage.

• Springle updated the task description. (Show Details)

• Springle added projects: DBA, acl*sre-team.

• Springle added subscribers: • Springle, • jcrespo.

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptJul 23 2015, 6:06 AM

dbstore s7 has showed strange problems recently (T104471)

That was not a "strange problem", that was a misconfiguration error due to wrong filtering.

Also, dbstore100[12] issues where different, where a stop slave; start slave; fixed replication because false duplicate key errors.

Possibilities:

If the delete happened at the time of the previous import, it could be just a re-population error
If the delete happened just before, maybe there is a b) another misconfiguration (replication filtering) b) a bug (default parallel replication)/upgrade error.

Are we sure this only happens on multi-source hosts? I would say T105879 is a better related error (duplicate key) on a host that was recently updated/restarted.

P1047 Replication error on dbstore2001

1	Actual error found by icinga
2	============================
3	Error 'Duplicate entry 'equivalent-fa' for key 'site_ids_type'' on query. Default database: 'hewiki'. Query: 'INSERT /* DBSiteStore::saveSites */ INTO `site_identifiers` (si_site,si_type,si_key) VALUES ('203','equivalent','fa')'
4
5
6	Error log
7	=========
8	/* original error */
9	150722 19:07:51 [ERROR] Master 's7': Slave SQL: Error 'Duplicate entry 'equivalent-fa' for key 'site_ids_type'' on query. Default database: 'hewiki'. Query: 'INSERT /* DBSiteStore::saveSites */ INTO `site_identifiers` (si_site,si_type,si_key) VALUES ('203','equivalent','fa')', Gtid 0-101622-888355486, Internal MariaDB error code: 1062
10	150722 19:07:51 [Warning] Master 's7': Slave: Duplicate entry 'equivalent-fa' for key 'site_ids_type' Error_code: 1062
11	150722 19:07:51 [ERROR] Master 's7': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'db2029-bin.000524' position 267965354
12
13	/* This is me trying a START SLAVE, as that sometimes fixed the problem with tokudb tables (supposedly because a bug) */
14	150722 19:30:13 [Note] Master 's7': Slave SQL thread initialized, starting replication in log 'db2029-bin.000524' at position 267965354, relay log './dbstore2001-relay-bin-s7.000098' position: 267965642
15
16	/* but we get the same result */
17	150722 19:30:13 [ERROR] Master 's7': Slave SQL: Error 'Duplicate entry 'equivalent-fa' for key 'site_ids_type'' on query. Default database: 'hewiki'. Query: 'INSERT /* DBSiteStore::saveSites */ INTO `site_identifiers` (si_site,si_type,si_key) VALUES ('203','equivalent','fa')', Gtid 0-101622-888355486, Internal MariaDB error code: 1062
18	150722 19:30:13 [Warning] Master 's7': Slave: Duplicate entry 'equivalent-fa' for key 'site_ids_type' Error_code: 1062
19	150722 19:30:13 [ERROR] Master 's7': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'db2029-bin.000524' position 267965354
20
21	/* this is me executing default_master_connection='s7'; SQL_SLAVE_SKIP_COUNTER=1; */
22	150722 19:32:34 [Note] Master 's7': Slave SQL thread initialized, starting replication in log 'db2029-bin.000524' at position 267965354, relay log './dbstore2001-relay-bin-s7.000098' position: 267965642
23	150722 19:32:34 [Note] Master 's7': 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='./dbstore2001-relay-bin-s7.000098', relay_log_pos='267965642', master_log_name='db2029-bin.000524', master_log_pos='267965354' and new position at relay_log_file='./dbstore2001-relay-bin-s7.000098', relay_log_pos='268555717', master_log_name='db2029-bin.000524', master_log_pos='268555429'
24
25	/* Works for almost 3 hours, then it fails again */
26	150722 22:18:33 [ERROR] Master 's7': Slave SQL: Error 'Duplicate entry 'interwiki-fa' for key 'site_ids_type'' on query. Default database: 'hewiki'. Query: 'INSERT /* DBSiteStore::saveSites */ INTO `site_identifiers` (si_site,si_type,si_key) VALUES ('203','interwiki','fa')', Gtid 0-101622-888773852, Internal MariaDB error code: 1062
27	150722 22:18:33 [Warning] Master 's7': Slave: Duplicate entry 'interwiki-fa' for key 'site_ids_type' Error_code: 1062
28	150722 22:18:33 [ERROR] Master 's7': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'db2029-bin.000524' position 576743119
29
30	Entries from the relay log (dbstore2001-relay-bin-s7.000098)
31	============================================================
32	/* The following block is without --database="hewiki" and skipping to 267965642 */
33
34	# at 267965642
35	#150722 19:07:50 server id 101622 end_log_pos 267965392 GTID 0-101622-888355486
36	/!100001 SET @@session.gtid_domain_id=0//!/;
37	/!100001 SET @@session.server_id=101622//!/;
38	/!100001 SET @@session.gtid_seq_no=888355486//!/;
39	BEGIN
40	/!/;
41	# at 267965680
42	#150722 19:07:49 server id 101622 end_log_pos 267965916 Query thread_id=775736833 exec_time=1 error_code=0
43	use `hewiki`/!/;
44	SET TIMESTAMP=1437592069/!/;
45	SET @@session.pseudo_thread_id=775736833/!/;
46	SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/!/;
47	SET @@session.sql_mode=0/!/;
48	SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/!/;
49	/!\C binary //!/;
50	SET @@session.character_set_client=63,@@session.collation_connection=63,@@session.collation_server=63/!/;
51	SET @@session.lc_time_names=0/!/;
52	SET @@session.collation_database=DEFAULT/!/;
53
54	/* now the real changes */
55
56	# at 267965642
57	#150722 19:07:50 server id 101622 end_log_pos 267965392 GTID 0-101622-888355486
58	/!100001 SET @@session.gtid_seq_no=888355486//!/;
59	BEGIN
60	/!/;
61	# at 267965680
62	#150722 19:07:49 server id 101622 end_log_pos 267965916 Query thread_id=775736833 exec_time=1 error_code=0
63	SET TIMESTAMP=1437592069/!/;
64	UPDATE /* DBSiteStore::saveSites */ `sites` SET site_id = '1',site_global_key = 'aawiki',site_type = 'mediawiki',site_group = 'wikipedia',site_source = 'local',site_language = 'aa',site_protocol = 'https',site_domain = 'gro.aidepikiw.aa.',site_data = 'a:1:{s:5:\"paths\";a:2:{s:9:\"file_path\";s:29:\"https://aa.wikipedia.org/w/$1\";s:9:\"page_path\";s:32:\"https://aa.wikipedia.org/wiki/$1\";}}',site_forward = '',site_config = 'a:0:{}' WHERE site_id = '1'
65	/!/;
66	.
67	.
68	.
69	/* MORE UPDATES */
70	.
71	.
72	.
73	# at 268443255
74	#150722 19:07:50 server id 101622 end_log_pos 268448349 Query thread_id=775736833 exec_time=1 error_code=0
75	SET TIMESTAMP=1437592070/!/;
76	DELETE /* DBSiteStore::saveSites */ FROM `site_identifiers` WHERE si_site IN ('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29','30','31','32','33','34','35','36','37','38','39','783','40','41','42','43','44','45','46','47','48','49','50','51','52','53','54','55','56','57','58','59','60','61','62','63','64','65','66','67','68','69','70','71','72','73','74','75','76','77','78','79','80','81','82','83','84','85','86','87','88','89','90','91','92','93','94','95','96','97','98','99','100','101','102','103','104','105','106','107','108','109','110','111','112','113','114','115','116','117','118','119','120','121','122','123','124','125','126','127','128','129','130','131','132','133','134','135','136','137','138','139','140','141','142','143','144','145','146','147','148','149','150','151','152','153','154','155','156','157','158','159','160','161','162','163','164','165','166','167','168','169','170','791','171','172','173','174','175','176','177','178','179','180','181','182','183','184','185','186','187','188','189','190','191','192','784','193','194','195','196','197','198','199','200','201','202','203','204','205','206','207','208','884','209','210','211','212','213','214','215','216','217','218','219','220','221','222','223','224','225','226','227','228','229','230','231','232','233','234','235','236','237','238','239','240','241','242','243','244','245','246','247','248','249','250','251','252','253','894','254','255','256','257','258','259','260','261','262','263','264','265','266','267','268','269','270','271','272','792','273','274','275','276','277','278','279','280','281','282','283','284','285','286','287','288','289','290','291','292','293','294','295','296','297','298','299','300','301','302','303','304','305','306','307','308','309','310','311','312','313','314','315','316','317','318','319','320','321','322','323','324','325','326','327','328','329','330','331','332','333','334','335','336','337','338','339','340','341','342','343','344','345','346','347','348','349','350','351','352','353','354','355','356','357','358','359','360','361','362','363','364','365','366','367','368','369','370','371','372','373','785','374','375','376','377','378','379','380','381','382','383','384','385','386','387','388','389','390','391','392','393','394','395','396','397','398','399','400','401','402','403','404','405','406','407','408','409','410','411','412','413','414','415','416','417','418','419','892','420','421','422','423','424','425','426','427','428','888','429','430','431','432','433','434','435','436','437','438','439','786','440','441','442','443','444','445','446','447','448','449','450','451','452','453','454','455','456','457','458','459','460','461','462','463','464','465','466','467','468','469','470','471','472','473','474','475','476','477','478','479','480','481','482','483','484','485','486','487','488','489','490','491','492','493','494','495','496','497','498','499','500','501','502','503','504','505','506','507','508','509','510','511','512','513','514','515','516','517','887','518','519','520','521','522','523','524','525','526','527','528','529','530','531','532','533','534','535','536','787','537','538','539','540','541','542','543','544','545','546','547','548','549','550','788','551','552','553','554','555','556','557','558','559','560','561','562','563','564','565','566','789','567','568','569','570','571','572','573','574','575','576','577','578','579','580','581','582','583','790','584','585','586','587','588','589','590','591','592','593','594','595','596','597','598','599','600','601','602','603','604','605','606','607','608','609','610','611','612','613','614','615','616','617','618','619','620','621','622','623','624','625','626','627','628','629','630','631','632','633','634','635','636','637','638','639','640','641','642','643','644','645','646','647','648','649','650','651','652','653','654','655','656','657','658','659','660','661','662','663','664','665','666','667','668','669','670','671','672','673','674','675','676','677','678','679','680','681','682','683','684','685','686','687','688','689','690','691','692','693','694','695','696','697','698','699','700','701','702','703','704','705','706','707','708','709','796','710','711','712','713','714','715','716','717','718','719','720','793','721','722','723','724','725','726','727','728','729','730','794','731','732','733','734','735','736','737','797','738','739','740','741','742','743','744','745','746','747','748','749','750','751','752','753','754','755','756','757','758','759','760','761','762','763','764','765','766','767','768','769','770','771','772','880','773','774','775','776','777','778','779','780','781','782','798','799','800','801','802','803','804','805','806','807','808','809','810','885','811','812','813','891','814','815','816','817','818','819','820','821','822','823','824','825','826','827','828','886','881','829','830','831','832','833','834','835','836','837','838','839','840','841','842','843','844','845','846','847','848','849','850','851','852','853','854','855','856','857','858','859','860','861','862','863','864','865','867','868','795','869','870','871','872','873','874','875','876','877','878','882','893','879','883')
77	/!/;
78	# at 268448637
79	#150722 19:07:50 server id 101622 end_log_pos 268448531 Query thread_id=775736833 exec_time=1 error_code=0
80	SET TIMESTAMP=1437592070/!/;
81	INSERT /* DBSiteStore::saveSites */ INTO `site_identifiers` (si_site,si_type,si_key) VALUES ('1','equivalent','aa')
82	.
83	.
84	.
85	/* MORE INSERTS */
86	.
87	.
88	.
89	#150722 19:07:50 server id 101622 end_log_pos 268473862 Query thread_id=775736833 exec_time=1 error_code=0
90	SET TIMESTAMP=1437592070/!/;
91	INSERT /* DBSiteStore::saveSites */ INTO `site_identifiers` (si_site,si_type,si_key) VALUES ('203','equivalent','fa')
92	/!/;
93	.
94	.
95	.
96	/* it does not fail once, but twice */
97	.
98	.
99	.
100	# at 577221025
101	#150722 22:18:31 server id 101622 end_log_pos 577226125 Query thread_id=776764027 exec_time=1 error_code=0
102	SET TIMESTAMP=1437603511/!/;
103	DELETE /* DBSiteStore::saveSites */ FROM `site_identifiers` WHERE si_site IN ('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23'...,'780',...,'883')
104	.
105	.
106	.
107	# at 577333283
108	#150722 22:18:32 server id 101622 end_log_pos 577333178 Query thread_id=776764027 exec_time=0 error_code=0
109	SET TIMESTAMP=1437603512/!/;
110	INSERT /* DBSiteStore::saveSites */ INTO `site_identifiers` (si_site,si_type,si_key) VALUES ('780','interwiki','zu')
111	/!/;
112
113	Table structure
114	===============
115
116	CREATE TABLE `site_identifiers` (
117	`si_site` int(10) unsigned NOT NULL,
118	`si_type` varbinary(32) NOT NULL,
119	`si_key` varbinary(32) NOT NULL,
120	UNIQUE KEY `site_ids_type` (`si_type`,`si_key`),
121	KEY `site_ids_site` (`si_site`),
122	KEY `site_ids_key` (`si_key`)
123	) ENGINE=InnoDB DEFAULT CHARSET=binary;
124

@Springle Did you do any out-of-band changes to the table after the second replication error? (I do not care if you did, *I did out of band changes to correct the first issue*). Because if you confirm me that you didn't, I can confirm that the DELETE was not executed or it had no effect despite being on the relay log. Current status:

SELECT count(*) FROM `site_identifiers` WHERE si_site IN ('1','2','3','4',...,'879','883') and si_site > '203';
+----------+
| count(*) |
+----------+
|      440 |
+----------+
1 row in set (0.00 sec)

I think we can discard the population issues, at least for this error. Random things that should work but that I throw here as potential offenders:

filtering (because it happened in the past)
multisource (because lol)
strings literals used for in fields (not a reason)
Package size/max number of "in" items (again, something that should fail, not ignore things silently)

• jcrespo added a parent task: T105135: Implement mariadb 10.0 masters.Jul 23 2015, 10:48 AM

@jcrespo, no, I did not out-of-band change or use skip counter. I found the machine exactly as you described on IRC, and only did research by dumping logs to get the query examples shown in ticket description.

Good point, I had forgotten T105879. The last working theory we had for db1022 on IRC was related to replication starting before mysql_upgrade back on June 29, or something else dto o with the skip counter usage on that day. If we could disprove those theories and definitively link these two issues...

It's strange that this is happening suddenly despite 10.0.16 packages running for months. If there is a replication bug, multi-source is most suspicious ...and the checksum tests become very important priority.

• jcrespo triaged this task as High priority.Jul 24 2015, 6:20 AM

• jcrespo moved this task from Triage to In progress on the DBA board.

Tests I would do:

What happens if we do a CHANGE MASTER to the most recent DELETE and START REPLICATION UNTIL the following event id? We could even activate the binary log/general log to check it. But we need to do that soon before the binlogs on the master disappear.
Can we try again with full disabled parallel replication?
What are all differences in configuration between these 2 hosts an the others? (pt-config-diff or manually)
Can we try again with: GTID, filtering disabled, innodb-table-based replication control, ROW replication (that last is problematic, as the master executed the events long time ago).

It's strange that this is happening suddenly despite 10.0.16 packages running for months.

I agree, I restart these 3 machines not a long time ago and then start having issues? My main thesis is conf-related. Replication issues with these hosts started when you restarted these 2 for maintenance the last time, too. Its master was also recently restarted, I think.

checksum tests become very important priority

I couldn't agree more.

Reloaded dbstore2001 s7 from db2040.

Importing now dbstore2002, all connections on the server will have lag while this happens.

Import on dbstore2002 finished. We have a 3-day lag on dbstore2002, but I prefer that than performing another manual import because it will be less error-prone and this way it will validate both its correctness and performance.

Things that were not solved on this ticket that will be tracked separately:

Drift detection: checked by doing checksums T104459
Drift solution: done by migrating to ROW based replication T109179
TokuDB issues: T109069
Filtering issues: not a concern right now
Multisource replication issues: not a concern right now

• jcrespo closed this task as Resolved.Aug 15 2015, 2:49 PM

mariadb multi-source replication glitch with site_identifiersClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

mariadb multi-source replication glitch with site_identifiers
Closed, ResolvedPublic
Actions

Related Objects
Search...