Page MenuHomePhabricator

Civicrm mariadb replication breaks on foreign key constraint
Closed, ResolvedPublic

Description

Yesterday and today Mariadb replication failed on a foreign key constraint error, on a query that appears to be a user-triggered Civicrm dedupe operation.

Query:

UPDATE civicrm_mailing_recipients SET contact_id = 1594978 WHERE contact_id = 55695209

Error:

Could not execute Update_rows_v1 event on table civicrm.civicrm_mailing_recipients; Cannot add or update a child row: a foreign key constraint fails (`civicrm`.`civicrm_mailing_recipients`, CONSTRAINT `FK_civicrm_mailing_recipients_email_id` FOREIGN KEY (`email_id`) REFERENCES `civicrm_email` (`id`) ON DELETE SET NULL), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log frdb1004-bin.010258, end_log_pos 9142698

Schema:

CREATE TABLE `civicrm_mailing_recipients` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `mailing_id` int(10) unsigned NOT NULL COMMENT 'The ID of the mailing this Job will send.',
  `contact_id` int(10) unsigned NOT NULL COMMENT 'FK to Contact',
  `email_id` int(10) unsigned DEFAULT NULL,
  `phone_id` int(10) unsigned DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `FK_civicrm_mailing_recipients_mailing_id` (`mailing_id`),
  KEY `FK_civicrm_mailing_recipients_contact_id` (`contact_id`),
  KEY `FK_civicrm_mailing_recipients_email_id` (`email_id`),
  KEY `FK_civicrm_mailing_recipients_phone_id` (`phone_id`),
  CONSTRAINT `FK_civicrm_mailing_recipients_contact_id` FOREIGN KEY (`contact_id`) REFERENCES `civicrm_contact` (`id`) ON DELETE CASCADE,
  CONSTRAINT `FK_civicrm_mailing_recipients_email_id` FOREIGN KEY (`email_id`) REFERENCES `civicrm_email` (`id`) ON DELETE SET NULL,
  CONSTRAINT `FK_civicrm_mailing_recipients_mailing_id` FOREIGN KEY (`mailing_id`) REFERENCES `civicrm_mailing` (`id`) ON DELETE CASCADE,
  CONSTRAINT `FK_civicrm_mailing_recipients_phone_id` FOREIGN KEY (`phone_id`) REFERENCES `civicrm_phone` (`id`) ON DELETE SET NULL
) ENGINE=InnoDB AUTO_INCREMENT=41576357 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

CREATE TABLE `civicrm_email` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT COMMENT 'Unique Email ID',
  `contact_id` int(10) unsigned DEFAULT NULL,
  `location_type_id` int(10) unsigned DEFAULT NULL,
  `email` varchar(254) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `is_primary` tinyint(4) NOT NULL DEFAULT 0 COMMENT 'Is this the primary email.',
  `is_billing` tinyint(4) NOT NULL DEFAULT 0 COMMENT 'Is this the billing email.',
  `on_hold` tinyint(4) NOT NULL DEFAULT 0 COMMENT 'Is this address on bounce hold?',
  `is_bulkmail` tinyint(4) NOT NULL DEFAULT 0 COMMENT 'Is this address for bulk mail ?',
  `hold_date` datetime DEFAULT NULL COMMENT 'When the address went on bounce hold',
  `reset_date` datetime DEFAULT NULL COMMENT 'When the address bounce status was last reset',
  `signature_text` mediumtext COLLATE utf8mb4_unicode_ci DEFAULT NULL COMMENT 'Text formatted signature for the email.',
  `signature_html` mediumtext COLLATE utf8mb4_unicode_ci DEFAULT NULL COMMENT 'HTML formatted signature for the email.',
  PRIMARY KEY (`id`),
  KEY `UI_email` (`email`),
  KEY `index_location_type` (`location_type_id`),
  KEY `index_is_primary` (`is_primary`),
  KEY `index_is_billing` (`is_billing`),
  KEY `FK_civicrm_email_contact_id` (`contact_id`),
  CONSTRAINT `FK_civicrm_email_contact_id` FOREIGN KEY (`contact_id`) REFERENCES `civicrm_contact` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=56260452 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

Notes:

  • Mariadb versions at the time were origin 10.4.24, replicas 10.4.24, 10.4.25 and 10.5.16.
  • All replicas were affected.
  • Foreign key checks are on globally, all tables are innodb.
  • The relevant before-query rows from both tables in frdev1001:/home/jgreen/20220701-foreign_key_fail.txt.
  • The fix was to rerun the query manually on each replica server (no need to disable foreign_key_checks), skip the binlog entry, and restart replication.
  • We upgraded the origin server to 10.4.25 this AM before putting civicrm back online.

Event Timeline

Jgreen triaged this task as High priority.Jul 1 2022, 1:20 PM
Jgreen created this task.

My first question would be if the data is the same in the master than in the replicas? Have you executed a data check lately?
We use compare.py for that which is a tool developed by Jaime which is pretty light and can be run in production with no issues: https://github.com/wikimedia/operations-software-wmfmariadbpy/blob/master/wmfmariadbpy/cli_admin/compare.py

You can also probably enable semi-sync to prevent further issues in the future (after fixing the data drift issue)

Jgreen claimed this task.
Jgreen moved this task from Triage to Done on the fundraising-tech-ops board.

We run with replicas set to read-only, all replicas stopped at the same point in replication, and the queries ran cleanly and had the expected result on all replicas when run manually. So I don't think it's very likely that the issue was replica data corruption. This has not occured since upgrading the origin server to 10.4.25. Closing the task since the problem has stopped. We will reopen if it happens again.