Page MenuHomePhabricator

Run pt-table-checksum on s4 (commonswiki)
Closed, ResolvedPublic

Description

Run pt-table-checksum on s4 - commonswiki.
The following tables do not have a PK and thus need to be excluded:

archive_save
blobs
categorylinks
change_tag
click_tracking
click_tracking_user_properties
cur
edit_page_tracking
hidden
imagelinks
interwiki
iwlinks
l10n_cache
langlinks
linkscc
localisation_file_hash
log_search
logging_pre_1_10
math
module_deps
msg_resource
msg_resource_links
objectcache
oldimage
pagelinks
prefstats
prefswitch_survey
profiling
querycache
querycache_info
querycachetwo
revtag
searchindex
securepoll_lists
securepoll_msgs
securepoll_properties
site_identifiers
site_stats
tag_summary
templatelinks
text
transcache
translate_messageindex
translate_reviews
user_former_groups
user_newtalk
user_properties
uw_campaign_conf
watchlist

The __wmf_checksums is placed on:

commonswiki.__wmf_checksums

Event Timeline

Filters enabled on rc slaves:

Replicate_Wild_Ignore_Table: commonswiki.__wmf_checksums

Mentioned in SAL (#wikimedia-operations) [2017-04-10T12:55:56Z] <marostegui> Run pt-table-checksum on s4 - T162593

This is now running, I am closely monitoring for this first run to make sure no lag is generated (I would be surprised if we have any, as we haven't had any on any of the other shards)

This has finished, and these are the differences found:

db2065: archive, geo_tags, image
db2058: archive, geo_tags, image
db2051: archive, geo_tags, image
db2019: archive, geo_tags, image
db1059: archive, image
db1064: archive, geo_tags, image
db1068: archive, geo_tags, image
dbstore1002: archive, filearchive, geo_tags, image, wbc_entity_usage

That means that at least the following hosts in eqiad are in sync with the master (with the current one) but not with the one we talk about in (which was just a proposal): T162133#3156640

db1053
db1056
db1081
db1084
db1091

This shard is ready for compare.py to run

I am performing actual replaces on almost all hosts (the ones mentioned above, using db1053 as reference. The main issue (archive) are not missing or different updates, just different ids for the same rows.

Sorry, I meant to say that I used the current master/older master/newer hosts as references (which were the ones that didn't fail last time). Archive has been uniformized on all s4 hosts, continuing now with geo_tags.

geo_tags is now done, it was easier as it only had the weird precision difference between hosts on some ranges, and it is not filtered on labs. Checking now and fixing image.

image is progressing slowly- it is a large table and it has lots of differences (rows existing on certain hosts that should be deleted images).

image is done until letter C- This may sound like a small portion, but according to the records, it is around 2/3rds of the way.

image has been done 100% for db1059, and at the same time the differences with db1068 (which is identical to db1081,84, 91 and 97) have been fixed too. That may have corrected all other errors on all other hosts, but there could be still some others on the other hosts on different rows- I will do a quick check for the other hosts to make sure none have been skipped (specially dbstore1002).

I thought it was going to be fast, but db1056 took a whole day. db1053 is next. Hopefully the rest will not take as much.

I am quite confident of image now (I didn't check every host, but codfw for example seems to be in a much better state), checking filearchive next.

All differences mentioned on T162593#3175313 handled.

Now going with

oldimage
text
watchlist

Finished the others, only watchlist left.

Found some differences, but they could be false positives (watchlist is very dynamic, which usually creates false positives):

root@neodymium:~$ ./compare.py db1097 dbstore1002 commonswiki watchlist wl_id --step=50000 --group_concat_max_len=100000000
Rows are different WHERE wl_id BETWEEN 53650001 AND 53700000
Rows are different WHERE wl_id BETWEEN 63350001 AND 63400000
Rows are different WHERE wl_id BETWEEN 85450001 AND 85500000
Rows are different WHERE wl_id BETWEEN 85500001 AND 85550000
Rows are different WHERE wl_id BETWEEN 86050001 AND 86100000
Rows are different WHERE wl_id BETWEEN 86100001 AND 86126042

I will need more time on Monday to close this ticket.

Those were some false positives, checking the rest of the hosts now.

I am not fairly confident that the main tables on most relevant servers are the same. I have not checked and fixed every table and every server, but it should be good enough to avoid further replication problems, and continue decommissioning old servers.