Run pt-table-checksum on s4 (commonswiki)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Marostegui
	Apr 10 2017, 11:54 AM

Description

Run pt-table-checksum on s4 - commonswiki.
The following tables do not have a PK and thus need to be excluded:

archive_save
blobs
categorylinks
change_tag
click_tracking
click_tracking_user_properties
cur
edit_page_tracking
hidden
imagelinks
interwiki
iwlinks
l10n_cache
langlinks
linkscc
localisation_file_hash
log_search
logging_pre_1_10
math
module_deps
msg_resource
msg_resource_links
objectcache
oldimage
pagelinks
prefstats
prefswitch_survey
profiling
querycache
querycache_info
querycachetwo
revtag
searchindex
securepoll_lists
securepoll_msgs
securepoll_properties
site_identifiers
site_stats
tag_summary
templatelinks
text
transcache
translate_messageindex
translate_reviews
user_former_groups
user_newtalk
user_properties
uw_campaign_conf
watchlist

The __wmf_checksums is placed on:

commonswiki.__wmf_checksums

Related Objects
Search...

Status	Assigned	Task
Duplicate	• Cmjohnson	T160731 Decom db1048 (BBU Faulty - slave lagging)
Resolved	None	T134476 Decommission old coredb machines (<=db1050)
Resolved	• Cmjohnson	T175679 Decommission db1048 (was Move m3 slave to db1059)
Resolved	jcrespo	T162593 Run pt-table-checksum on s4 (commonswiki)

Event Timeline

• Marostegui created this task.Apr 10 2017, 11:54 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 10 2017, 11:54 AM

• Marostegui moved this task from Triage to In progress on the DBA board.Apr 10 2017, 11:54 AM

Filters enabled on rc slaves:

Replicate_Wild_Ignore_Table: commonswiki.__wmf_checksums

Mentioned in SAL (#wikimedia-operations) [2017-04-10T12:55:56Z] <marostegui> Run pt-table-checksum on s4 - T162593

This is now running, I am closely monitoring for this first run to make sure no lag is generated (I would be surprised if we have any, as we haven't had any on any of the other shards)

• Marostegui updated the task description. (Show Details)Apr 10 2017, 12:58 PM

• Marostegui updated the task description. (Show Details)

• Marostegui added a parent task: T134476: Decommission old coredb machines (<=db1050).Apr 10 2017, 1:00 PM

This has finished, and these are the differences found:

db2065: archive, geo_tags, image
db2058: archive, geo_tags, image
db2051: archive, geo_tags, image
db2019: archive, geo_tags, image
db1059: archive, image
db1064: archive, geo_tags, image
db1068: archive, geo_tags, image
dbstore1002: archive, filearchive, geo_tags, image, wbc_entity_usage

That means that at least the following hosts in eqiad are in sync with the master (with the current one) but not with the one we talk about in (which was just a proposal): T162133#3156640

db1053
db1056
db1081
db1084
db1091

• Marostegui mentioned this in T163110: Reclone db1068 to become master in s4.Apr 17 2017, 2:22 PM

This shard is ready for compare.py to run

• Marostegui removed • Marostegui as the assignee of this task.May 26 2017, 9:31 AM

jcrespo claimed this task.May 26 2017, 3:18 PM

• Marostegui mentioned this in T170351: db2019 has performance issues, replace disk or switchover s4 master elsewhere.Aug 4 2017, 8:22 AM

I am performing actual replaces on almost all hosts (the ones mentioned above, using db1053 as reference. The main issue (archive) are not missing or different updates, just different ids for the same rows.

Sorry, I meant to say that I used the current master/older master/newer hosts as references (which were the ones that didn't fail last time). Archive has been uniformized on all s4 hosts, continuing now with geo_tags.

geo_tags is now done, it was easier as it only had the weird precision difference between hosts on some ranges, and it is not filtered on labs. Checking now and fixing image.

image is progressing slowly- it is a large table and it has lots of differences (rows existing on certain hosts that should be deleted images).

image is done until letter C- This may sound like a small portion, but according to the records, it is around 2/3rds of the way.

• Marostegui awarded a token.Sep 1 2017, 4:41 PM

image has been done 100% for db1059, and at the same time the differences with db1068 (which is identical to db1081,84, 91 and 97) have been fixed too. That may have corrected all other errors on all other hosts, but there could be still some others on the other hosts on different rows- I will do a quick check for the other hosts to make sure none have been skipped (specially dbstore1002).

I thought it was going to be fast, but db1056 took a whole day. db1053 is next. Hopefully the rest will not take as much.

db1053 done, db1064 next.

I am quite confident of image now (I didn't check every host, but codfw for example seems to be in a much better state), checking filearchive next.

All differences mentioned on T162593#3175313 handled.

Now going with

oldimage
text
watchlist

Finished the others, only watchlist left.

Found some differences, but they could be false positives (watchlist is very dynamic, which usually creates false positives):

root@neodymium:~$ ./compare.py db1097 dbstore1002 commonswiki watchlist wl_id --step=50000 --group_concat_max_len=100000000
Rows are different WHERE wl_id BETWEEN 53650001 AND 53700000
Rows are different WHERE wl_id BETWEEN 63350001 AND 63400000
Rows are different WHERE wl_id BETWEEN 85450001 AND 85500000
Rows are different WHERE wl_id BETWEEN 85500001 AND 85550000
Rows are different WHERE wl_id BETWEEN 86050001 AND 86100000
Rows are different WHERE wl_id BETWEEN 86100001 AND 86126042

I will need more time on Monday to close this ticket.

Those were some false positives, checking the rest of the hosts now.

I am not fairly confident that the main tables on most relevant servers are the same. I have not checked and fixed every table and every server, but it should be good enough to avoid further replication problems, and continue decommissioning old servers.

jcrespo added a parent task: T175679: Decommission db1048 (was Move m3 slave to db1059).Sep 12 2017, 12:05 PM

• Marostegui mentioned this in T179767: Database replica drift on web and analytics clusters.Nov 4 2017, 6:28 PM

• Marostegui added a comment.Nov 24 2017, 12:35 PM

This comment was removed by • Marostegui.

• Marostegui mentioned this in T183735: Check data consistency across production shards.Dec 28 2017, 7:54 AM

jcrespo mentioned this in T202032: Duplicate ar_rev_id values in several wikis.Aug 16 2018, 8:00 AM

Run pt-table-checksum on s4 (commonswiki)Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Run pt-table-checksum on s4 (commonswiki)
Closed, ResolvedPublic
Actions

Related Objects
Search...