Page MenuHomePhabricator

mariadb multi-source replication glitch with site_identifiers
Closed, ResolvedPublic

Description

Both dbstore200[12] boxes in codfw saw replication for s7 stop with duplicate key error:

Error 'Duplicate entry 'interwiki-fa' for key 'site_ids_type'' on query. Default database: 'hewiki'. Query: 'INSERT /* DBSiteStore::saveSites  */  INTO `site_identifiers` (si_site,si_type,si_key) VALUES ('203','interwiki','fa')'

It is genuine problem. Table is InnoDB and key does indeed exist, so engine is justified in throwing the error. However, none of the other InnoDB slaves were affected, nor the dbstore100[12] boxes in eqiad, which also use multi-source replication. We have seen similar issues before, but with TokuDB bugs, not InnoDB. Misdiagnosis previously? Else, wtf?

DBSiteStore::saveSites job does a large DELETE first:

DELETE /* DBSiteStore::saveSites  */ FROM `site_identifiers` WHERE si_site IN ('1','2','3','4','5','6','7','8','9', ... '874','875','876','877','878','882','890','879','883');

... then INSERTs records one-by-one:

INSERT /* DBSiteStore::saveSites  */  INTO `site_identifiers` (si_site,si_type,si_key) VALUES ('203','interwiki','fa');

Theories/questions:

  • dbstore200[12] are 10.0 multi-source slaves that also replicate from 10.0 master (dbstore100[12] are 10.0 but replicate from eqiad 5.5 masters). Interestingly, labsdb100x boxes are the only other 10.0 multi-source slaves with 10.0 master db1069, and they have seen replication issues that were previously put down to TokuDB. Not so?
  • dbstore s7 has showed strange problems recently (T104471) where data simply stopped replicating, though SQL/IO threads continued to run without error. That seemed related to replication rules on s7 which are "special" and include centralauth.%. Coincidence?

Event Timeline

Springle claimed this task.
Springle raised the priority of this task from to Needs Triage.
Springle updated the task description. (Show Details)
Springle added projects: DBA, acl*sre-team.

dbstore s7 has showed strange problems recently (T104471)

That was not a "strange problem", that was a misconfiguration error due to wrong filtering.

Also, dbstore100[12] issues where different, where a stop slave; start slave; fixed replication because false duplicate key errors.

Possibilities:

  • If the delete happened at the time of the previous import, it could be just a re-population error
  • If the delete happened just before, maybe there is a b) another misconfiguration (replication filtering) b) a bug (default parallel replication)/upgrade error.

Are we sure this only happens on multi-source hosts? I would say T105879 is a better related error (duplicate key) on a host that was recently updated/restarted.

1Actual error found by icinga
2============================
3Error 'Duplicate entry 'equivalent-fa' for key 'site_ids_type'' on query. Default database: 'hewiki'. Query: 'INSERT /* DBSiteStore::saveSites */ INTO `site_identifiers` (si_site,si_type,si_key) VALUES ('203','equivalent','fa')'
4
5
6Error log
7=========
8/* original error */
9150722 19:07:51 [ERROR] Master 's7': Slave SQL: Error 'Duplicate entry 'equivalent-fa' for key 'site_ids_type'' on query. Default database: 'hewiki'. Query: 'INSERT /* DBSiteStore::saveSites */ INTO `site_identifiers` (si_site,si_type,si_key) VALUES ('203','equivalent','fa')', Gtid 0-101622-888355486, Internal MariaDB error code: 1062
10150722 19:07:51 [Warning] Master 's7': Slave: Duplicate entry 'equivalent-fa' for key 'site_ids_type' Error_code: 1062
11150722 19:07:51 [ERROR] Master 's7': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'db2029-bin.000524' position 267965354
12
13/* This is me trying a START SLAVE, as that sometimes fixed the problem with tokudb tables (supposedly because a bug) */
14150722 19:30:13 [Note] Master 's7': Slave SQL thread initialized, starting replication in log 'db2029-bin.000524' at position 267965354, relay log './dbstore2001-relay-bin-s7.000098' position: 267965642
15
16/* but we get the same result */
17150722 19:30:13 [ERROR] Master 's7': Slave SQL: Error 'Duplicate entry 'equivalent-fa' for key 'site_ids_type'' on query. Default database: 'hewiki'. Query: 'INSERT /* DBSiteStore::saveSites */ INTO `site_identifiers` (si_site,si_type,si_key) VALUES ('203','equivalent','fa')', Gtid 0-101622-888355486, Internal MariaDB error code: 1062
18150722 19:30:13 [Warning] Master 's7': Slave: Duplicate entry 'equivalent-fa' for key 'site_ids_type' Error_code: 1062
19150722 19:30:13 [ERROR] Master 's7': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'db2029-bin.000524' position 267965354
20
21/* this is me executing default_master_connection='s7'; SQL_SLAVE_SKIP_COUNTER=1; */
22150722 19:32:34 [Note] Master 's7': Slave SQL thread initialized, starting replication in log 'db2029-bin.000524' at position 267965354, relay log './dbstore2001-relay-bin-s7.000098' position: 267965642
23150722 19:32:34 [Note] Master 's7': 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='./dbstore2001-relay-bin-s7.000098', relay_log_pos='267965642', master_log_name='db2029-bin.000524', master_log_pos='267965354' and new position at relay_log_file='./dbstore2001-relay-bin-s7.000098', relay_log_pos='268555717', master_log_name='db2029-bin.000524', master_log_pos='268555429'
24
25/* Works for almost 3 hours, then it fails again */
26150722 22:18:33 [ERROR] Master 's7': Slave SQL: Error 'Duplicate entry 'interwiki-fa' for key 'site_ids_type'' on query. Default database: 'hewiki'. Query: 'INSERT /* DBSiteStore::saveSites */ INTO `site_identifiers` (si_site,si_type,si_key) VALUES ('203','interwiki','fa')', Gtid 0-101622-888773852, Internal MariaDB error code: 1062
27150722 22:18:33 [Warning] Master 's7': Slave: Duplicate entry 'interwiki-fa' for key 'site_ids_type' Error_code: 1062
28150722 22:18:33 [ERROR] Master 's7': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'db2029-bin.000524' position 576743119
29
30Entries from the relay log (dbstore2001-relay-bin-s7.000098)
31============================================================
32/* The following block is without --database="hewiki" and skipping to 267965642 */
33
34# at 267965642
35#150722 19:07:50 server id 101622 end_log_pos 267965392 GTID 0-101622-888355486
36/*!100001 SET @@session.gtid_domain_id=0*//*!*/;
37/*!100001 SET @@session.server_id=101622*//*!*/;
38/*!100001 SET @@session.gtid_seq_no=888355486*//*!*/;
39BEGIN
40/*!*/;
41# at 267965680
42#150722 19:07:49 server id 101622 end_log_pos 267965916 Query thread_id=775736833 exec_time=1 error_code=0
43use `hewiki`/*!*/;
44SET TIMESTAMP=1437592069/*!*/;
45SET @@session.pseudo_thread_id=775736833/*!*/;
46SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
47SET @@session.sql_mode=0/*!*/;
48SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
49/*!\C binary *//*!*/;
50SET @@session.character_set_client=63,@@session.collation_connection=63,@@session.collation_server=63/*!*/;
51SET @@session.lc_time_names=0/*!*/;
52SET @@session.collation_database=DEFAULT/*!*/;
53
54/* now the real changes */
55
56# at 267965642
57#150722 19:07:50 server id 101622 end_log_pos 267965392 GTID 0-101622-888355486
58/*!100001 SET @@session.gtid_seq_no=888355486*//*!*/;
59BEGIN
60/*!*/;
61# at 267965680
62#150722 19:07:49 server id 101622 end_log_pos 267965916 Query thread_id=775736833 exec_time=1 error_code=0
63SET TIMESTAMP=1437592069/*!*/;
64UPDATE /* DBSiteStore::saveSites */ `sites` SET site_id = '1',site_global_key = 'aawiki',site_type = 'mediawiki',site_group = 'wikipedia',site_source = 'local',site_language = 'aa',site_protocol = 'https',site_domain = 'gro.aidepikiw.aa.',site_data = 'a:1:{s:5:\"paths\";a:2:{s:9:\"file_path\";s:29:\"https://aa.wikipedia.org/w/$1\";s:9:\"page_path\";s:32:\"https://aa.wikipedia.org/wiki/$1\";}}',site_forward = '',site_config = 'a:0:{}' WHERE site_id = '1'
65/*!*/;
66.
67.
68.
69/* MORE UPDATES */
70.
71.
72.
73# at 268443255
74#150722 19:07:50 server id 101622 end_log_pos 268448349 Query thread_id=775736833 exec_time=1 error_code=0
75SET TIMESTAMP=1437592070/*!*/;
76DELETE /* DBSiteStore::saveSites */ FROM `site_identifiers` WHERE si_site IN ('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29','30','31','32','33','34','35','36','37','38','39','783','40','41','42','43','44','45','46','47','48','49','50','51','52','53','54','55','56','57','58','59','60','61','62','63','64','65','66','67','68','69','70','71','72','73','74','75','76','77','78','79','80','81','82','83','84','85','86','87','88','89','90','91','92','93','94','95','96','97','98','99','100','101','102','103','104','105','106','107','108','109','110','111','112','113','114','115','116','117','118','119','120','121','122','123','124','125','126','127','128','129','130','131','132','133','134','135','136','137','138','139','140','141','142','143','144','145','146','147','148','149','150','151','152','153','154','155','156','157','158','159','160','161','162','163','164','165','166','167','168','169','170','791','171','172','173','174','175','176','177','178','179','180','181','182','183','184','185','186','187','188','189','190','191','192','784','193','194','195','196','197','198','199','200','201','202','203','204','205','206','207','208','884','209','210','211','212','213','214','215','216','217','218','219','220','221','222','223','224','225','226','227','228','229','230','231','232','233','234','235','236','237','238','239','240','241','242','243','244','245','246','247','248','249','250','251','252','253','894','254','255','256','257','258','259','260','261','262','263','264','265','266','267','268','269','270','271','272','792','273','274','275','276','277','278','279','280','281','282','283','284','285','286','287','288','289','290','291','292','293','294','295','296','297','298','299','300','301','302','303','304','305','306','307','308','309','310','311','312','313','314','315','316','317','318','319','320','321','322','323','324','325','326','327','328','329','330','331','332','333','334','335','336','337','338','339','340','341','342','343','344','345','346','347','348','349','350','351','352','353','354','355','356','357','358','359','360','361','362','363','364','365','366','367','368','369','370','371','372','373','785','374','375','376','377','378','379','380','381','382','383','384','385','386','387','388','389','390','391','392','393','394','395','396','397','398','399','400','401','402','403','404','405','406','407','408','409','410','411','412','413','414','415','416','417','418','419','892','420','421','422','423','424','425','426','427','428','888','429','430','431','432','433','434','435','436','437','438','439','786','440','441','442','443','444','445','446','447','448','449','450','451','452','453','454','455','456','457','458','459','460','461','462','463','464','465','466','467','468','469','470','471','472','473','474','475','476','477','478','479','480','481','482','483','484','485','486','487','488','489','490','491','492','493','494','495','496','497','498','499','500','501','502','503','504','505','506','507','508','509','510','511','512','513','514','515','516','517','887','518','519','520','521','522','523','524','525','526','527','528','529','530','531','532','533','534','535','536','787','537','538','539','540','541','542','543','544','545','546','547','548','549','550','788','551','552','553','554','555','556','557','558','559','560','561','562','563','564','565','566','789','567','568','569','570','571','572','573','574','575','576','577','578','579','580','581','582','583','790','584','585','586','587','588','589','590','591','592','593','594','595','596','597','598','599','600','601','602','603','604','605','606','607','608','609','610','611','612','613','614','615','616','617','618','619','620','621','622','623','624','625','626','627','628','629','630','631','632','633','634','635','636','637','638','639','640','641','642','643','644','645','646','647','648','649','650','651','652','653','654','655','656','657','658','659','660','661','662','663','664','665','666','667','668','669','670','671','672','673','674','675','676','677','678','679','680','681','682','683','684','685','686','687','688','689','690','691','692','693','694','695','696','697','698','699','700','701','702','703','704','705','706','707','708','709','796','710','711','712','713','714','715','716','717','718','719','720','793','721','722','723','724','725','726','727','728','729','730','794','731','732','733','734','735','736','737','797','738','739','740','741','742','743','744','745','746','747','748','749','750','751','752','753','754','755','756','757','758','759','760','761','762','763','764','765','766','767','768','769','770','771','772','880','773','774','775','776','777','778','779','780','781','782','798','799','800','801','802','803','804','805','806','807','808','809','810','885','811','812','813','891','814','815','816','817','818','819','820','821','822','823','824','825','826','827','828','886','881','829','830','831','832','833','834','835','836','837','838','839','840','841','842','843','844','845','846','847','848','849','850','851','852','853','854','855','856','857','858','859','860','861','862','863','864','865','867','868','795','869','870','871','872','873','874','875','876','877','878','882','893','879','883')
77/*!*/;
78# at 268448637
79#150722 19:07:50 server id 101622 end_log_pos 268448531 Query thread_id=775736833 exec_time=1 error_code=0
80SET TIMESTAMP=1437592070/*!*/;
81INSERT /* DBSiteStore::saveSites */ INTO `site_identifiers` (si_site,si_type,si_key) VALUES ('1','equivalent','aa')
82.
83.
84.
85/* MORE INSERTS */
86.
87.
88.
89#150722 19:07:50 server id 101622 end_log_pos 268473862 Query thread_id=775736833 exec_time=1 error_code=0
90SET TIMESTAMP=1437592070/*!*/;
91INSERT /* DBSiteStore::saveSites */ INTO `site_identifiers` (si_site,si_type,si_key) VALUES ('203','equivalent','fa')
92/*!*/;
93.
94.
95.
96/* it does not fail once, but twice */
97.
98.
99.
100# at 577221025
101#150722 22:18:31 server id 101622 end_log_pos 577226125 Query thread_id=776764027 exec_time=1 error_code=0
102SET TIMESTAMP=1437603511/*!*/;
103DELETE /* DBSiteStore::saveSites */ FROM `site_identifiers` WHERE si_site IN ('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23'...,'780',...,'883')
104.
105.
106.
107# at 577333283
108#150722 22:18:32 server id 101622 end_log_pos 577333178 Query thread_id=776764027 exec_time=0 error_code=0
109SET TIMESTAMP=1437603512/*!*/;
110INSERT /* DBSiteStore::saveSites */ INTO `site_identifiers` (si_site,si_type,si_key) VALUES ('780','interwiki','zu')
111/*!*/;
112
113Table structure
114===============
115
116CREATE TABLE `site_identifiers` (
117 `si_site` int(10) unsigned NOT NULL,
118 `si_type` varbinary(32) NOT NULL,
119 `si_key` varbinary(32) NOT NULL,
120 UNIQUE KEY `site_ids_type` (`si_type`,`si_key`),
121 KEY `site_ids_site` (`si_site`),
122 KEY `site_ids_key` (`si_key`)
123) ENGINE=InnoDB DEFAULT CHARSET=binary;
124

@Springle Did you do any out-of-band changes to the table after the second replication error? (I do not care if you did, *I did out of band changes to correct the first issue*). Because if you confirm me that you didn't, I can confirm that the DELETE was not executed or it had no effect despite being on the relay log. Current status:

SELECT count(*) FROM `site_identifiers` WHERE si_site IN ('1','2','3','4',...,'879','883') and si_site > '203';
+----------+
| count(*) |
+----------+
|      440 |
+----------+
1 row in set (0.00 sec)

I think we can discard the population issues, at least for this error. Random things that should work but that I throw here as potential offenders:

  • filtering (because it happened in the past)
  • multisource (because lol)
  • strings literals used for in fields (not a reason)
  • Package size/max number of "in" items (again, something that should fail, not ignore things silently)

@jcrespo, no, I did not out-of-band change or use skip counter. I found the machine exactly as you described on IRC, and only did research by dumping logs to get the query examples shown in ticket description.

Good point, I had forgotten T105879. The last working theory we had for db1022 on IRC was related to replication starting before mysql_upgrade back on June 29, or something else dto o with the skip counter usage on that day. If we could disprove those theories and definitively link these two issues...

It's strange that this is happening suddenly despite 10.0.16 packages running for months. If there is a replication bug, multi-source is most suspicious ...and the checksum tests become very important priority.

jcrespo moved this task from Triage to In progress on the DBA board.

Tests I would do:

  • What happens if we do a CHANGE MASTER to the most recent DELETE and START REPLICATION UNTIL the following event id? We could even activate the binary log/general log to check it. But we need to do that soon before the binlogs on the master disappear.
  • Can we try again with full disabled parallel replication?
  • What are all differences in configuration between these 2 hosts an the others? (pt-config-diff or manually)
  • Can we try again with: GTID, filtering disabled, innodb-table-based replication control, ROW replication (that last is problematic, as the master executed the events long time ago).

It's strange that this is happening suddenly despite 10.0.16 packages running for months.

I agree, I restart these 3 machines not a long time ago and then start having issues? My main thesis is conf-related. Replication issues with these hosts started when you restarted these 2 for maintenance the last time, too. Its master was also recently restarted, I think.

checksum tests become very important priority

I couldn't agree more.

Reloaded dbstore2001 s7 from db2040.

Importing now dbstore2002, all connections on the server will have lag while this happens.

Import on dbstore2002 finished. We have a 3-day lag on dbstore2002, but I prefer that than performing another manual import because it will be less error-prone and this way it will validate both its correctness and performance.

Things that were not solved on this ticket that will be tracked separately:

  • Drift detection: checked by doing checksums T104459
  • Drift solution: done by migrating to ROW based replication T109179
  • TokuDB issues: T109069
  • Filtering issues: not a concern right now
  • Multisource replication issues: not a concern right now