db1022 duplicate key errors
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Springle
	Jul 15 2015, 1:06 PM

Description

db1022 (the jessie slave in s6) encountered a duplicate key error for replication:

150715  7:58:35 [ERROR] Slave SQL: Error 'Duplicate entry '116393999' for key 'rev_id'' on query. Default database: 'frwiki'. Query: 'INSERT /* Revision::insertOn Speculos */  INTO `revision` (rev_id,rev_page,rev_text_id,rev_comment,rev_minor_edit,rev_user,rev_user_text,rev_timestamp,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_model,rev_content_format) VALUES ('116393999','9284144','116332767','Résumé automatique : contenu remplacé par « {{suppression Immédiate|bac à sable et pb de droit d\'auteur éventuel}} ».','0','37895','Nohky','20150629083036','0','73','0','6sv0jpus8bhmaznoo8q4t3v8782b1ev',NULL,NULL)', Internal MariaDB error code: 1062
150715  7:58:35 [Warning] Slave: Duplicate entry '116393999' for key 'rev_id' Error_code: 1062
150715  7:58:35 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'db1023-bin.000644' position 779771134
150715  8:08:01 [Note] 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='./db1022-relay-bin.001931', relay_log_pos='779771422', master_log_name='db1023-bin.000644', master_log_pos='779771134' and new position at relay_log_file='./db1022-relay-bin.001931', relay_log_pos='779775490', master_log_name='db1023-bin.000644', master_log_pos='779775202'

The SQL_SLAVE_SKIP_COUNTER was me responding.

The revision record rev_id 116393999 did exist on db1022 but not an any other s6 slave (and so the insert succeeded elsewhere). The db1022 relay log and db1023 master binlog show this was a legitimate revision undelete:

# at 779771577
#150715  7:58:34 server id 101612  end_log_pos 779772139        Query   thread_id=893423917     exec_time=0     error_code=0
SET TIMESTAMP=1436947114/*!*/;
INSERT /* Revision::insertOn Speculos */  INTO `revision` (rev_id,rev_page,rev_text_id,rev_comment,rev_minor_edit,rev_user,rev_user_text,rev_timestamp,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_model,rev_content_format) VALUES ('116393999','9284144','116332767','Résumé automatique : contenu remplacé par « {{suppression Immédiate|bac à sable et pb de droit d\'auteur éventuel}} ».','0','37895','Nohky','20150629083036','0','73','0','6sv0jpus8bhmaznoo8q4t3v8782b1ev',NULL,NULL)
/*!*/;
# at 779772139
#150715  7:58:34 server id 101612  end_log_pos 779772368        Query   thread_id=893423917     exec_time=0     error_code=0
SET TIMESTAMP=1436947114/*!*/;
DELETE /* PageArchive::undeleteRevisions Speculos */ FROM `archive` WHERE ar_namespace = '0' AND ar_title = 'Erreurs_amoureuses' AND ar_timestamp = '20150629083036'

So, the problem occurred sometime on 2015-06-29 where db1022 missed at least one revision deletion, and probably more stuff. The db1022 error log shows another SQL_SLAVE_SKIP_COUNTER on 2015-06-29, about when the box was being reinstalled as Jessie and (iirc) reloaded from a dump:

150629 20:12:11 [ERROR] Slave SQL: Error 'Duplicate entry '493888-0' for key 'PRIMARY'' on query. Default database: 'ruwiki'. Query: 'INSERT /* FlaggableWikiPage::updatePendingList 193.33.122.113 */  INTO `flaggedpage_pending` (fpp_page_id,fpp_quality,fpp_rev_id,fpp_pending_since) VALU
ES ('493888','0','42579174','20120928055510')', Internal MariaDB error code: 1062
...
150629 20:16:31 [Note] 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='./db1022-relay-bin.001845', relay_log_pos='874380', master_log_name='db1023-bin.000615', master_log_pos='1016985208' and new position at relay_log_file='./db1022-relay-bin.001845', relay_log_pos='875923', master_log_name='db1023-bin.000615', master_log_pos='1016986751'

@jcrespo, did anything seem odd back then with the reinstall and reload? Should we reclone this box entirely?

Details

Subject	Repo	Branch	Lines +/-
Setting weight values for s6 to original production values	operations/mediawiki-config	master	+4 -4
Repool db1050 & db1022 after emergency maintenance with low weight	operations/mediawiki-config	master	+2 -2
Reconfiguring db1050 after cloning	operations/puppet	production	+2 -2
Emergency depool of db1050 for cloning to to db1022	operations/mediawiki-config	master	+3 -3

Customize query in gerrit

Related Objects

Mentioned In: T101516: Upgrade db1022, which has an older kernel
T106647: mariadb multi-source replication glitch with site_identifiers
rOMWCefeceb5749fd: depool db1022 T105879

Event Timeline

• Springle created this task.Jul 15 2015, 1:06 PM

• Springle claimed this task.

• Springle raised the priority of this task from to High.

• Springle updated the task description. (Show Details)

• Springle added projects: acl*sre-team, DBA.

• Springle added subscribers: • Springle, • jcrespo.

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptJul 15 2015, 1:06 PM

For now, I will depool db1022.

Change 224783 had a related patch set uploaded (by Springle):
depool db1022 T105879

https://gerrit.wikimedia.org/r/224783

Change 224783 merged by jenkins-bot:
depool db1022 T105879

https://gerrit.wikimedia.org/r/224783

• Springle mentioned this in rOMWCefeceb5749fd: depool db1022 T105879.Jul 15 2015, 1:11 PM

@Springle db1022 was not exported/imported- it was physically backed up and restarted.

@jcrespo, ouch, so maybe something is broken in our physical backup process (was it xtrabackup?) that required the SQL_SLAVE_SKIP_COUNTER after reinstall on 2015-06-29? Scary...

@Springle I did not even use xtrabackup:
stop slave; service mysql stop; tar; gz; nc; md5sum; reinstall; nc; untar; mysqld_safe; mysql_upgrade

Please note that the SKIP was done once on a complete unrelated table (which already had the value).

I will investigate further by doing checksums.

• jcrespo moved this task from Triage to Backlog on the DBA board.Jul 20 2015, 3:39 PM

• jcrespo mentioned this in T106647: mariadb multi-source replication glitch with site_identifiers.Jul 23 2015, 7:03 AM

• Springle closed this task as Resolved.Aug 28 2015, 4:25 AM

• jcrespo mentioned this in T101516: Upgrade db1022, which has an older kernel.Oct 28 2015, 11:34 AM

• jcrespo reopened this task as Open.Dec 26 2015, 6:26 PM

We need to reimage db1022, something is very wrong with it.

Change 261190 had a related patch set uploaded (by Jcrespo):
Emergency depool of db1050 for cloning to to db1022

https://gerrit.wikimedia.org/r/261190

Change 261191 had a related patch set uploaded (by Jcrespo):
Reconfiguring db1050 after cloning

https://gerrit.wikimedia.org/r/261191

Change 261190 merged by Jcrespo:
Emergency depool of db1050 for cloning to to db1022

https://gerrit.wikimedia.org/r/261190

Change 261191 merged by Jcrespo:
Reconfiguring db1050 after cloning

https://gerrit.wikimedia.org/r/261191

Change 261271 had a related patch set uploaded (by Jcrespo):
Repool db1050 & db1022 after emergency maintenance with low weight

https://gerrit.wikimedia.org/r/261271

Change 261271 merged by Jcrespo:
Repool db1050 & db1022 after emergency maintenance with low weight

https://gerrit.wikimedia.org/r/261271

Change 261331 had a related patch set uploaded (by Jcrespo):
Setting weight values for s6 to original production values

https://gerrit.wikimedia.org/r/261331

Change 261331 merged by Jcrespo:
Setting weight values for s6 to original production values