Missing row in enwiki.archive on sanitarium
Closed, ResolvedPublic

Description

Replication broken on sanitarium hosts db1124 and db2094 with message of
db1124:

CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table enwiki.archive: Can't find record in 'archive', Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the event's master log db1106-bin.001274, end_log_pos 586890683

db2094:

CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table enwiki.archive: Can't find record in 'archive', Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the event's master log db2072-bin.001782, end_log_pos 170138133
Banyek created this task.Nov 7 2018, 3:21 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 7 2018, 3:21 PM
Banyek triaged this task as High priority.Nov 7 2018, 3:22 PM

I check which row tries to get updated, and create a fake row for both of the hosts the let replication continue. After it's done I'll run a table comparison on that table between the sanitarium and the sanitarium master

Banyek added a comment.EditedNov 7 2018, 3:52 PM

After checking the binlogs, the following update is not able to run on db1124:

[content masked]

I the row is missing indeed from the sanitarium host so I create it, and restart replication;

Banyek added a comment.EditedNov 7 2018, 4:02 PM

The query which I'll run will be:

INSERT INTO `enwiki`.`archive` VALUES [content_masked]);
Banyek added a comment.EditedNov 7 2018, 4:13 PM

And that not works:

INSERT INTO `enwiki`.`archive` VALUES [content_masked];
ERROR 1062 (23000): Duplicate entry '647689559' for key 'ar_revid_uniq'
MariaDB [enwiki]> select * from enwiki.archive where ar_rev_id='647689559';
[data_exists]
1 row in set (0.00 sec)
Banyek added a comment.Nov 7 2018, 4:15 PM

So I think it would be the most safe to treat this as the T208672:
(Truncate the table and reimport from db1106)

Banyek added a comment.Nov 7 2018, 4:24 PM

db1124:

SET GLOBAL replicate_wild_ignore_table ='mysql.%,oai.%,advisorswiki.%,arbcom_cswiki.%,arbcom_dewiki.%,arbcom_enwiki.%,arbcom_fiwiki.%,arbcom_nlwiki.%,auditcomwiki.%,boardgovcomwiki.%,boardwiki.%,chairwiki.%,chapcomwiki.%,checkuserwiki.%,collabwiki.%,ecwikimedia.%,electcomwiki.%,execwiki.%,fdcwiki.%,grantswiki.%,id_internalwikimedia.%,iegcomwiki.%,ilwikimedia.%,internalwiki.%,legalteamwiki.%,movementroleswiki.%,noboard_chapterswikimedia.%,officewiki.%,ombudsmenwiki.%,otrs_wikiwiki.%,projectcomwiki.%,searchcomwiki.%,spcomwiki.%,stewardwiki.%,techconductwiki.%,transitionteamwiki.%,wg_enwiki.%,wikimaniateamwiki.%,zerowiki.%,%.__wmf_checksums,%.accountaudit_login,%.arbcom1_vote,%.archive_old,%.blob_orphans,%.blob_tracking,%.bot_passwords,%.bv2009_edits,%.categorylinks_old,%.click_tracking,%.cu_changes,%.cu_log,%.cur,%.echo_email_batch,%.echo_event,%.echo_target_page,%.echo_unread_wikis,%.echo_notification,%.edit_page_tracking,%.email_capture,%.exarchive,%.exrevision,%.filejournal,%.globalnames,%.hidden,%.image_old,%.job,%.linkscc,%.localnames,%.log_search,%.logging_old,%.long_run_profiling,%.migrateuser_medium,%.moodbar_feedback,%.moodbar_feedback_response,%.msg_resource,%.oathauth_users,%.oauth_accepted_consumer,%.oauth_registered_consumer,%.objectcache,%.old_growth,%.oldimage_old,%.optin_survey,%.prefstats,%.prefswitch_survey,%.profiling,%.querycache,%.querycache_info,%.querycache_old,%.querycachetwo,%.reading_list,%.reading_list_entry,%.securepoll_cookie_match,%.securepoll_elections,%.securepoll_entity,%.securepoll_lists,%.securepoll_msgs,%.securepoll_options,%.securepoll_properties,%.securepoll_questions,%.securepoll_strike,%.securepoll_voters,%.securepoll_votes,%.spoofuser,%.text,%.titlekey,%.transcache,%.uploadstash,%.user_newtalk,%.vote_log,%.watchlist, enwiki.archive'
Banyek added a comment.Nov 7 2018, 4:26 PM

I did the same on db2094 too.
Now both the slaves are catching up. I'll reimport the data after, first on db2094 , and if no issues on db1124 too

Banyek added a comment.Nov 7 2018, 4:54 PM

replicas catched up.

Mentioned in SAL (#wikimedia-operations) [2018-11-07T21:06:09Z] <banyek> stopping replication on db2072 (T208954)

Banyek added a subscriber: Bstorm.Nov 7 2018, 11:21 PM

I restored the data on db2094. (same way as in T208672)
The time between the truncate table and finish the loading was 82 minutes.
Now I need to do the same on db1124 but I put it to hold, because that host is the active sanitarium for the labsdb hosts. After talking about this with @Bstorm I am not sure if we don't want to notice the users about this.
I'd say we should do it, but I need a second opinion from @jcrespo

@Anomie I do strongly believe that mediawiki has recently gone unsafe (regression) in the latest releases- this is one of the many replication issues we had recently- we should search recent commits for unsafe statements or this will continue happening.

There are no actionables for you here, feel free to unsubscribe, but I wanted to gives you a heads up that we do need to do some code audit for sql statements modifed recently (or batch-ran recently). Note that new sanitariums were setup very recently copied directly from production and do not carry the issues that were fixed long ago with the old, already fixed archive INSERT...SELECT.

Change 472405 had a related patch set uploaded (by Banyek; owner: Banyek):
[operations/mediawiki-config@master] mariadb: depool db1106

https://gerrit.wikimedia.org/r/472405

Banyek moved this task from Triage to In progress on the DBA board.Nov 8 2018, 11:49 AM
Anomie added a comment.Nov 8 2018, 2:33 PM

@Anomie I do strongly believe that mediawiki has recently gone unsafe (regression) in the latest releases- this is one of the many replication issues we had recently- we should search recent commits for unsafe statements or this will continue happening.

There are no actionables for you here, feel free to unsubscribe, but I wanted to gives you a heads up that we do need to do some code audit for sql statements modifed recently (or batch-ran recently). Note that new sanitariums were setup very recently copied directly from production and do not carry the issues that were fixed long ago with the old, already fixed archive INSERT...SELECT.

Are we sure that all replicas were identical until recently? Or is it possible that the archive table was inconsistent on different replicas due to old INSERT...SELECT issues and we're only finding them now thanks to this maintenance run that's updating all the archive table rows?

The fact that the new sanitariums were recently set up doesn't really matter if they cloned existing inconsistencies.

Looking at enwiki revision 647689559, the relevant archive table rows were most likely created at 2015-02-18T16:06:03Z.

Mentioned in SAL (#wikimedia-operations) [2018-11-09T09:02:50Z] <banyek> depooling db1106 (T208954)

Change 472405 merged by Banyek:
[operations/mediawiki-config@master] mariadb: depool db1106

https://gerrit.wikimedia.org/r/472405

Mentioned in SAL (#wikimedia-operations) [2018-11-09T09:21:24Z] <banyek> stopping replication on db1106 (T208954)

Banyek added a comment.Nov 9 2018, 9:25 AM

creating dump of enwiki.archive on db1106 with:

mysqldump enwiki archive --skip-ssl --no-create-info > enwiki.archive.sql

Mentioned in SAL (#wikimedia-operations) [2018-11-09T09:45:50Z] <banyek> truncating enwiki.archive on db1124 and labsdb hosts too (T208954)

Banyek added a comment.Nov 9 2018, 9:46 AM

the table was copied to db1124 with

transfer.py db1106.eqiad.wmnet:/home/banyek/enwiki.archive.sql db1124.eqiad.wmnet:/home/banyek/

loading dump in db1124 with:

mysql -S /run/mysqld/mysqld.s1.sock --skip-ssl enwiki <enwiki.archive.sql

the replication filters were set back to the original ones

Mentioned in SAL (#wikimedia-operations) [2018-11-09T11:45:32Z] <banyek> data load finished restarting replication on db1106 (T208954)

replication catched up

Mentioned in SAL (#wikimedia-operations) [2018-11-09T12:34:15Z] <banyek> repooling db1106 (T208954)

Banyek closed this task as Resolved.Nov 9 2018, 1:04 PM
Banyek claimed this task.

I re-run the check_private_data script on db1124 and the sanitarium is clean.