Page MenuHomePhabricator

db1022 duplicate key errors
Closed, ResolvedPublic

Description

db1022 (the jessie slave in s6) encountered a duplicate key error for replication:

150715  7:58:35 [ERROR] Slave SQL: Error 'Duplicate entry '116393999' for key 'rev_id'' on query. Default database: 'frwiki'. Query: 'INSERT /* Revision::insertOn Speculos */  INTO `revision` (rev_id,rev_page,rev_text_id,rev_comment,rev_minor_edit,rev_user,rev_user_text,rev_timestamp,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_model,rev_content_format) VALUES ('116393999','9284144','116332767','Résumé automatique : contenu remplacé par « {{suppression Immédiate|bac à sable et pb de droit d\'auteur éventuel}} ».','0','37895','Nohky','20150629083036','0','73','0','6sv0jpus8bhmaznoo8q4t3v8782b1ev',NULL,NULL)', Internal MariaDB error code: 1062
150715  7:58:35 [Warning] Slave: Duplicate entry '116393999' for key 'rev_id' Error_code: 1062
150715  7:58:35 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'db1023-bin.000644' position 779771134
150715  8:08:01 [Note] 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='./db1022-relay-bin.001931', relay_log_pos='779771422', master_log_name='db1023-bin.000644', master_log_pos='779771134' and new position at relay_log_file='./db1022-relay-bin.001931', relay_log_pos='779775490', master_log_name='db1023-bin.000644', master_log_pos='779775202'

The SQL_SLAVE_SKIP_COUNTER was me responding.

The revision record rev_id 116393999 did exist on db1022 but not an any other s6 slave (and so the insert succeeded elsewhere). The db1022 relay log and db1023 master binlog show this was a legitimate revision undelete:

# at 779771577
#150715  7:58:34 server id 101612  end_log_pos 779772139        Query   thread_id=893423917     exec_time=0     error_code=0
SET TIMESTAMP=1436947114/*!*/;
INSERT /* Revision::insertOn Speculos */  INTO `revision` (rev_id,rev_page,rev_text_id,rev_comment,rev_minor_edit,rev_user,rev_user_text,rev_timestamp,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_content_model,rev_content_format) VALUES ('116393999','9284144','116332767','Résumé automatique : contenu remplacé par « {{suppression Immédiate|bac à sable et pb de droit d\'auteur éventuel}} ».','0','37895','Nohky','20150629083036','0','73','0','6sv0jpus8bhmaznoo8q4t3v8782b1ev',NULL,NULL)
/*!*/;
# at 779772139
#150715  7:58:34 server id 101612  end_log_pos 779772368        Query   thread_id=893423917     exec_time=0     error_code=0
SET TIMESTAMP=1436947114/*!*/;
DELETE /* PageArchive::undeleteRevisions Speculos */ FROM `archive` WHERE ar_namespace = '0' AND ar_title = 'Erreurs_amoureuses' AND ar_timestamp = '20150629083036'

So, the problem occurred sometime on 2015-06-29 where db1022 missed at least one revision deletion, and probably more stuff. The db1022 error log shows another SQL_SLAVE_SKIP_COUNTER on 2015-06-29, about when the box was being reinstalled as Jessie and (iirc) reloaded from a dump:

150629 20:12:11 [ERROR] Slave SQL: Error 'Duplicate entry '493888-0' for key 'PRIMARY'' on query. Default database: 'ruwiki'. Query: 'INSERT /* FlaggableWikiPage::updatePendingList 193.33.122.113 */  INTO `flaggedpage_pending` (fpp_page_id,fpp_quality,fpp_rev_id,fpp_pending_since) VALU
ES ('493888','0','42579174','20120928055510')', Internal MariaDB error code: 1062
...
150629 20:16:31 [Note] 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='./db1022-relay-bin.001845', relay_log_pos='874380', master_log_name='db1023-bin.000615', master_log_pos='1016985208' and new position at relay_log_file='./db1022-relay-bin.001845', relay_log_pos='875923', master_log_name='db1023-bin.000615', master_log_pos='1016986751'

@jcrespo, did anything seem odd back then with the reinstall and reload? Should we reclone this box entirely?

Event Timeline

Springle claimed this task.
Springle raised the priority of this task from to High.
Springle updated the task description. (Show Details)
Springle added projects: acl*sre-team, DBA.

Change 224783 had a related patch set uploaded (by Springle):
depool db1022 T105879

https://gerrit.wikimedia.org/r/224783

@Springle db1022 was not exported/imported- it was physically backed up and restarted.

@jcrespo, ouch, so maybe something is broken in our physical backup process (was it xtrabackup?) that required the SQL_SLAVE_SKIP_COUNTER after reinstall on 2015-06-29? Scary...

@Springle I did not even use xtrabackup:
stop slave; service mysql stop; tar; gz; nc; md5sum; reinstall; nc; untar; mysqld_safe; mysql_upgrade

Please note that the SKIP was done once on a complete unrelated table (which already had the value).

I will investigate further by doing checksums.

We need to reimage db1022, something is very wrong with it.

Change 261190 had a related patch set uploaded (by Jcrespo):
Emergency depool of db1050 for cloning to to db1022

https://gerrit.wikimedia.org/r/261190

Change 261191 had a related patch set uploaded (by Jcrespo):
Reconfiguring db1050 after cloning

https://gerrit.wikimedia.org/r/261191

Change 261190 merged by Jcrespo:
Emergency depool of db1050 for cloning to to db1022

https://gerrit.wikimedia.org/r/261190

Change 261191 merged by Jcrespo:
Reconfiguring db1050 after cloning

https://gerrit.wikimedia.org/r/261191

Change 261271 had a related patch set uploaded (by Jcrespo):
Repool db1050 & db1022 after emergency maintenance with low weight

https://gerrit.wikimedia.org/r/261271

Change 261271 merged by Jcrespo:
Repool db1050 & db1022 after emergency maintenance with low weight

https://gerrit.wikimedia.org/r/261271

Change 261331 had a related patch set uploaded (by Jcrespo):
Setting weight values for s6 to original production values

https://gerrit.wikimedia.org/r/261331

Change 261331 merged by Jcrespo:
Setting weight values for s6 to original production values

https://gerrit.wikimedia.org/r/261331