Page MenuHomePhabricator

m1 slaves all broke replication due to bacula.DelCandidates temporary table
Closed, ResolvedPublic

Description

We need to investigate why that happened- it is not surprising, as temporary tables have many problems with replication, but there may be a core reason (a crash, config, etc.)

To mitigate issues, given that the temporary table is just used for writes in a loop of:

CREATE TEMPORARY TABLE DelCandidates (JobId INTEGER UNSIGNED NOT NULL, PurgedFiles TINYINT, FileSetId INTEGER UNSIGNED, JobFiles INTEGER UNSIGNED, JobStatus BINARY(1))
CREATE INDEX DelInx1 ON DelCandidates (JobId)
INSERT INTO DelCandidates SELECT JobId,PurgedFiles,FileSetId,JobFiles,JobStatus FROM Job  JOIN Client USING (ClientId)  JOIN Pool ON (Job.PoolId = Pool.PoolId)  WHERE Type IN ('B', 'C', 'M', 'V',  'D', 'R', 'c', 'm', 'g')   AND JobTDate < 1482537599  AND Client.Name = 'bromine.eqiad.wmnet-fd'  AND Pool.Name = 'production'
DROP TEMPORARY TABLE IF EXISTS `bacula`.`DelCandidates`

I have ignored the table on the 2 slaves. This has not been puppetized, because it should go away when the slaves catch up.

Maybe we need to transition to ROW-based replication?

Probably we need to either run pt-table-checksum (we do not care about lag on the slaves here) and/or rebuild the slaves from 0.

We also have to enable the replication check as non-critical (this is the most important part). The slaves do not need to page because they do not affect users, but they should be shown on icinga.

Event Timeline

jcrespo moved this task from Triage to In progress on the DBA board.

db1001 has been repooled without the replication filter again. I will keep it there for a second, and if the replication is stable, I will retire the filter from the other slave, too.

No replication filters anywhere, things look good.