Page MenuHomePhabricator

Duplicate rows error in db2095 replication @s7
Closed, ResolvedPublic

Description

We have an active icinga alert as

CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Update_rows_v1 event on table kowiki.archive: Duplicate entry '89988' for key 'ar_revid_uniq', Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the event's master log db2077-bin.001640, end_log_pos 792366153

The corresponding entry on db2095 is:

*************************** 1. row ***************************
            ar_id: 1011833
     ar_namespace: 0
         ar_title: 수소_폭탄
       ar_comment:
    ar_comment_id: 0
          ar_user: 942
     ar_user_text: ChongDae
         ar_actor: 0
     ar_timestamp: 20050515133948
    ar_minor_edit: 1
        ar_rev_id: 89988
       ar_text_id: 89988
       ar_deleted: 0
           ar_len: 23
       ar_page_id: 19711
     ar_parent_id: 49184
          ar_sha1: cenuot3lw5x0rhfjqeshqnj0llipqw5
 ar_content_model: NULL
ar_content_format: NULL
1 row in set (0.00 sec)

and on db2077 is:

*************************** 1. row ***************************
            ar_id: 1011832
     ar_namespace: 0
         ar_title: 수소_폭탄
       ar_comment:
    ar_comment_id: 3
          ar_user: 942
     ar_user_text: ChongDae
         ar_actor: 0
     ar_timestamp: 20050515133948
    ar_minor_edit: 1
        ar_rev_id: 89988
       ar_text_id: 89988
       ar_deleted: 0
           ar_len: 23
       ar_page_id: 19711
     ar_parent_id: 49184
          ar_sha1: cenuot3lw5x0rhfjqeshqnj0llipqw5
 ar_content_model: NULL
ar_content_format: NULL

And I am not sure how we suposed to deal with errors like this, so I just leave this 'ack'-d as the db2095 is a sanitarium host for codfw - which is not in production now, it can wait until Monday.
@jcrespo do you have any suggestions?

Event Timeline

This is what I did for T208565: stop replication, add db.table to the long list of filters (carefully), let it catch up, then stop the replication on its master (beware of icinga alerts, downtime all possible alerts first), truncate the table without altering its definition and reimport it logically -eg mysqldump- (the triggers should take care of the sanitization), restore the replication filter to the original state and restart replication on the master.

That will fix the immediate issue correctly as its master users ROW based replication, but given this is the second time this happens -the first was T208565- we should identify a root cause and review (compare) all sanitariums on codfw, specially the archive table. This could be a codfw general issue, an application issue, or just a sanitarium import issue- it didn't happen on eqiad sanitariums even if they were identical not a long time ago.

uhh, that's sounds really complex, but doable, I'll work on this tomorrow

the db2095 - db1125 and db2094 - db1124 shouldn't be identical? If so, it wouldn't be good to discard the santiarium host db2095 and clone it from db1125?

Banyek triaged this task as Medium priority.Nov 5 2018, 10:14 AM

hm... replication broken again, now on metawiki.archive

could T208565 and this ticket be related to T208695 either in root cause or in surface reason (writing a lot of rarely-written rows?).

hm... replication broken again, now on metawiki.archive

I will try once the same thing- if it breaks again we should be thinking about a recloning (of a full preventive check on all archive tables). Also please comment there when you do a corrective measure e.g. "I did T208672#4718738 on section X, things went well/badly" so I am up to date with the status of the fix.

Sadly Manuel was the person to setup these so I don't know the details of how it was done.

For the kowiki example, I note the following:

db1062 (current master)
+---------+--------------+---------------+------------+---------------+---------+--------------+----------+----------------+---------------+-----------+------------+------------+--------+------------+--------------+---------------------------------+------------------+-------------------+
| ar_id   | ar_namespace | ar_title      | ar_comment | ar_comment_id | ar_user | ar_user_text | ar_actor | ar_timestamp   | ar_minor_edit | ar_rev_id | ar_text_id | ar_deleted | ar_len | ar_page_id | ar_parent_id | ar_sha1                         | ar_content_model | ar_content_format |
+---------+--------------+---------------+------------+---------------+---------+--------------+----------+----------------+---------------+-----------+------------+------------+--------+------------+--------------+---------------------------------+------------------+-------------------+
| 1011832 |            0 | 수소_폭탄     |            |             3 |     942 | ChongDae     |        0 | 20050515134522 |             0 |     49191 |      49191 |          0 |     24 |      19711 |        89988 | hey8xwlrgqvp1qnq1na64hl4gecmpn6 | NULL             | NULL              |
| 1011833 |            0 | 수소_폭탄     |            |             3 |     942 | ChongDae     |        0 | 20050515133948 |             1 |     89988 |      89988 |          0 |     23 |      19711 |        49184 | cenuot3lw5x0rhfjqeshqnj0llipqw5 | NULL             | NULL              |
+---------+--------------+---------------+------------+---------------+---------+--------------+----------+----------------+---------------+-----------+------------+------------+--------+------------+--------------+---------------------------------+------------------+-------------------+
db2077
+---------+--------------+---------------+------------+---------------+---------+--------------+----------+----------------+---------------+-----------+------------+------------+--------+------------+--------------+---------------------------------+------------------+-------------------+
| ar_id   | ar_namespace | ar_title      | ar_comment | ar_comment_id | ar_user | ar_user_text | ar_actor | ar_timestamp   | ar_minor_edit | ar_rev_id | ar_text_id | ar_deleted | ar_len | ar_page_id | ar_parent_id | ar_sha1                         | ar_content_model | ar_content_format |
+---------+--------------+---------------+------------+---------------+---------+--------------+----------+----------------+---------------+-----------+------------+------------+--------+------------+--------------+---------------------------------+------------------+-------------------+
| 1011832 |            0 | 수소_폭탄     |            |             3 |     942 | ChongDae     |        0 | 20050515133948 |             1 |     89988 |      89988 |          0 |     23 |      19711 |        49184 | cenuot3lw5x0rhfjqeshqnj0llipqw5 | NULL             | NULL              |
| 1011833 |            0 | 수소_폭탄     |            |             3 |     942 | ChongDae     |        0 | 20050515134522 |             0 |     49191 |      49191 |          0 |     24 |      19711 |        89988 | hey8xwlrgqvp1qnq1na64hl4gecmpn6 | NULL             | NULL              |
+---------+--------------+---------------+------------+---------------+---------+--------------+----------+----------------+---------------+-----------+------------+------------+--------+------------+--------------+---------------------------------+------------------+-------------------+

It appears that somehow db2077 has these two rows switched around relative to db1062, as might have happened if they were created by a statement-replicated INSERT SELECT at some point.

As far as I can tell these rows were probably created from the MediaWiki side at 2016-01-06T04:42:12Z, based on the row in the logging table with log_id=958193, which would have been before rMW673371e2c71b: Avoid INSERT..SELECT in doArticleDeleteReal() removed an INSERT SELECT from the deletion logic.

Mentioned in SAL (#wikimedia-operations) [2018-11-06T11:07:31Z] <banyek> stopping replication on db2077 (T208672)

I stopped the replication on db2077.

stop slave

I created the table dumps on db2077 with :

mysqldump metawiki archive --skip-ssl --no-create-info > metawiki.archive.sql
mysqldump kowiki archive --skip-ssl --no-create-info > kowiki.archive.sql

from cumin1001 I copied them over to db2095 with

PYTHONPATH=/home/banyek/wmfmariadbpy/ ./transfer.py db2077.codfw.wmnet:/srv/tabledump/metawiki.archive.sql.gz db2095.codfw.wmnet:/home/banyek/
PYTHONPATH=/home/banyek/wmfmariadbpy/ ./transfer.py db2077.codfw.wmnet:/srv/tabledump/kowiki.archive.sql.gz db2095.codfw.wmnet:/home/banyek/

I truncate the tables on db2095 and then import the tables

mysql --skip-ssl -S /run/mysqld/mysqld.s7.sock
MariaDB > truncate table metawiki.archive;
MariaDB > truncate table kowiki.archive;

mysql --skip-ssl -S /run/mysqld/mysqld.s7.sock metawiki < metawiki.archive.sql
mysql --skip-ssl -S /run/mysqld/mysqld.s7.sock kowiki < kowiki.archive.sql

I restore replication filters on db2095 with

SET GLOBAL replicate_wild_ignore_table = mysql.%,oai.%,advisorswiki.%,arbcom_cswiki.%,arbcom_dewiki.%,arbcom_enwiki.%,arbcom_fiwiki.%,arbcom_nlwiki.%,auditcomwiki.%,boardgovcomwiki.%,boardwiki.%,chairwiki.%,chapcomwiki.%,checkuserwiki.%,collabwiki.%,ecwikimedia.%,electcomwiki.%,execwiki.%,fdcwiki.%,grantswiki.%,id_internalwikimedia.%,iegcomwiki.%,ilwikimedia.%,internalwiki.%,legalteamwiki.%,movementroleswiki.%,noboard_chapterswikimedia.%,officewiki.%,ombudsmenwiki.%,otrs_wikiwiki.%,projectcomwiki.%,searchcomwiki.%,spcomwiki.%,stewardwiki.%,techconductwiki.%,transitionteamwiki.%,wg_enwiki.%,wikimaniateamwiki.%,zerowiki.%,%.__wmf_checksums,%.accountaudit_login,%.arbcom1_vote,%.archive_old,%.blob_orphans,%.blob_tracking,%.bot_passwords,%.bv2009_edits,%.categorylinks_old,%.click_tracking,%.cu_changes,%.cu_log,%.cur,%.echo_email_batch,%.echo_event,%.echo_target_page,%.echo_unread_wikis,%.echo_notification,%.edit_page_tracking,%.email_capture,%.exarchive,%.exrevision,%.filejournal,%.globalnames,%.hidden,%.image_old,%.job,%.linkscc,%.localnames,%.log_search,%.logging_old,%.long_run_profiling,%.migrateuser_medium,%.moodbar_feedback,%.moodbar_feedback_response,%.msg_resource,%.oathauth_users,%.oauth_accepted_consumer,%.oauth_registered_consumer,%.objectcache,%.old_growth,%.oldimage_old,%.optin_survey,%.prefstats,%.prefswitch_survey,%.profiling,%.querycache,%.querycache_info,%.querycache_old,%.querycachetwo,%.reading_list,%.reading_list_entry,%.securepoll_cookie_match,%.securepoll_elections,%.securepoll_entity,%.securepoll_lists,%.securepoll_msgs,%.securepoll_options,%.securepoll_properties,%.securepoll_questions,%.securepoll_strike,%.securepoll_voters,%.securepoll_votes,%.spoofuser,%.text,%.titlekey,%.transcache,%.uploadstash,%.user_newtalk,%.vote_log,%.watchlist

and restart replication on db2077

I re-run the check_private_data.py to see the if reimport was good as the filters/triggers were working

Mentioned in SAL (#wikimedia-operations) [2018-11-09T09:21:02Z] <banyek> stopping replication on db1106 (T208672)

I re-run the check_private_data.py to see the if reimport was good as the filters/triggers were working

What's the status of this? Was this fully done?

sorry I didn't noted it here: yes, the process finished, and found no problems

sorry I didn't noted it here: yes, the process finished, and found no problems

Can this be resolved then? As the follow-ups will happen at T209048: Checking archive tables across the databases

Banyek claimed this task.