Both dbstore200[12] boxes in codfw saw replication for s7 stop with duplicate key error:
Error 'Duplicate entry 'interwiki-fa' for key 'site_ids_type'' on query. Default database: 'hewiki'. Query: 'INSERT /* DBSiteStore::saveSites */ INTO `site_identifiers` (si_site,si_type,si_key) VALUES ('203','interwiki','fa')'
It is genuine problem. Table is InnoDB and key does indeed exist, so engine is justified in throwing the error. However, none of the other InnoDB slaves were affected, nor the dbstore100[12] boxes in eqiad, which also use multi-source replication. We have seen similar issues before, but with TokuDB bugs, not InnoDB. Misdiagnosis previously? Else, wtf?
DBSiteStore::saveSites job does a large DELETE first:
DELETE /* DBSiteStore::saveSites */ FROM `site_identifiers` WHERE si_site IN ('1','2','3','4','5','6','7','8','9', ... '874','875','876','877','878','882','890','879','883');
... then INSERTs records one-by-one:
INSERT /* DBSiteStore::saveSites */ INTO `site_identifiers` (si_site,si_type,si_key) VALUES ('203','interwiki','fa');
Theories/questions:
- Somehow DELETE failed or was skipped? Yet only specific records were affected, not all ~800, so transaction must have executed. Single-threaded slaves were ok too, so maybe multi-source related? But why dbstore100[12] not affected? Check through list https://mariadb.atlassian.net/browse/MDEV-253?jql=text%20~%20%22multi-source%22
- dbstore200[12] are 10.0 multi-source slaves that also replicate from 10.0 master (dbstore100[12] are 10.0 but replicate from eqiad 5.5 masters). Interestingly, labsdb100x boxes are the only other 10.0 multi-source slaves with 10.0 master db1069, and they have seen replication issues that were previously put down to TokuDB. Not so?
- dbstore s7 has showed strange problems recently (T104471) where data simply stopped replicating, though SQL/IO threads continued to run without error. That seemed related to replication rules on s7 which are "special" and include centralauth.%. Coincidence?