Page MenuHomePhabricator

run pt-tablechecksum on s6
Closed, ResolvedPublic

Description

The last database of s2 is being checksummed on T154485 so it is time to move on to another shard.

s6 has the following servers that need to be decommissioned: db1023,db1022,db1030,db1037.
Let's checksum their data and see if we can decommission them easily
The __wmf_checksums is placed on every database of the shard. That means

frwiki.__wmf_checksums
jawiki.__wmf_checksums
ruwiki.__wmf_checksums

This will be a good test to run p-t-c on slightly larger tables than s2, which so far, gave no problem, but the largest table was a 40G revision table on itwiki.

Event Timeline

The following tables per database do not have a PK and will need to be excluded:

frwiki:

archive_save
categorylinks
change_tag
click_tracking
click_tracking_user_properties
cur
edit_page_tracking
flaggedrevs_tracking
hidden
imagelinks
interwiki
iwlinks
l10n_cache
langlinks
linkscc
localisation_file_hash
log_search
logging_pre_1_10
math
module_deps
msg_resource
msg_resource_links
objectcache
oldimage
pagelinks
prefstats
prefswitch_survey
profiling
querycache
querycache_info
querycachetwo
searchindex
securepoll_lists
securepoll_msgs
securepoll_properties
site_identifiers
site_stats
tag_summary
templatelinks
text
transcache
user_former_groups
user_newtalk
user_properties
watchlist

jawiki:

categorylinks
change_tag
click_tracking
click_tracking_user_properties
cur
edit_page_tracking
hidden
imagelinks
interwiki
iwlinks
l10n_cache
langlinks
linkscc
localisation_file_hash
log_search
logging_pre_1_10
math
module_deps
msg_resource
msg_resource_links
objectcache
oldimage
pagelinks
prefstats
prefswitch_survey
querycache
querycache_info
querycachetwo
searchindex
securepoll_lists
securepoll_msgs
securepoll_properties
site_identifiers
site_stats
tag_summary
templatelinks
text
transcache
user_former_groups
user_newtalk
user_properties
watchlist

ruwiki:

categorylinks
change_tag
click_tracking
click_tracking_user_properties
cur
edit_page_tracking
ep_users_per_course
hidden
imagelinks
interwiki
iwlinks
l10n_cache
langlinks
linkscc
log_search
logging_pre_1_10
math
module_deps
msg_resource
msg_resource_links
objectcache
oldimage
pagelinks
prefstats
prefswitch_survey
querycache
querycache_info
querycachetwo
searchindex
securepoll_lists
securepoll_msgs
securepoll_properties
site_identifiers
site_stats
tag_summary
templatelinks
text
transcache
user_former_groups
user_newtalk
user_properties
watchlist

dsns_s6 table looks good, and I have added the replication filter to the rc slaves (db1037 and db2039) to ignore the wmf_checksums table to avoid the issue we already had: T154485#3050716

Replicate_Wild_Ignore_Table: ops.__wmf_checksums

Mentioned in SAL (#wikimedia-operations) [2017-03-16T07:08:48Z] <marostegui> Starting pt-table-checksum on s6 (frwiki) - T160509

I have just started the check on frwiki.

Finished running pt-table-checksum on frwiki.
Differences found on:

Differences on db1030
frwiki.archive
Differences on dbstore1002
frwiki.archive
frwiki.page_props
frwiki.wbc_entity_usage

Mentioned in SAL (#wikimedia-operations) [2017-03-21T07:22:16Z] <marostegui> Run pt-table-checksum on s6 (jawiki) - T160509

jawiki has been finished and there are no differences! \o/

ruwiki finished too, so the whole shard has been checksumed.
ruwiki has a few differences:

Differences on db2046
TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY
ruwiki.geo_tags 1 0 1 PRIMARY 1 81696384

Differences on db1030
TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY
ruwiki.archive 51330 0 1 PRIMARY 6612569 6612675
ruwiki.archive 54562 0 1 PRIMARY 7105225 7105398
ruwiki.archive 54563 0 1 PRIMARY 7105399 7105510

And dbstore1002 has a lot more.

I think it finally works:

$ python3 compare.py db2046.codfw.wmnet dbstore1002 ruwiki geo_tags gt_id --step=10000
Rows are different WHERE gt_id BETWEEN 81260001 AND 81270000
Rows are different WHERE gt_id BETWEEN 81330001 AND 81340000
Rows are different WHERE gt_id BETWEEN 81350001 AND 81360000
Rows are different WHERE gt_id BETWEEN 81360001 AND 81370000
Rows are different WHERE gt_id BETWEEN 81370001 AND 81380000
Rows are different WHERE gt_id BETWEEN 81650001 AND 81660000

Change 345188 had a related patch set uploaded (by Jcrespo):
[operations/puppet@production] [WIP]Quick & dirty script to check data differences between tables

https://gerrit.wikimedia.org/r/345188

I think it finally works:

$ python3 compare.py db2046.codfw.wmnet dbstore1002 ruwiki geo_tags gt_id --step=10000
Rows are different WHERE gt_id BETWEEN 81260001 AND 81270000
Rows are different WHERE gt_id BETWEEN 81330001 AND 81340000
Rows are different WHERE gt_id BETWEEN 81350001 AND 81360000
Rows are different WHERE gt_id BETWEEN 81360001 AND 81370000
Rows are different WHERE gt_id BETWEEN 81370001 AND 81380000
Rows are different WHERE gt_id BETWEEN 81650001 AND 81660000

This is awesome, great job!!!

I am claiming this just for coordination purposes, not meaning I do not recognize you (Manuel) have done most of the work already.

jcrespo triaged this task as High priority.Apr 11 2017, 3:05 PM

I am claiming this just for coordination purposes, not meaning I do not recognize you (Manuel) have done most of the work already.

Do not even need to mention it! And I believe your archeology work is a lot harder than what I have done anyways!

Core servers have been all checked an fixed. Only ones missing are:

dbstore1001
+--------+----------+------------+--------+
| db     | tbl      | total_rows | chunks |
+--------+----------+------------+--------+
| ruwiki | geo_tags |        100 |      1 |
+--------+----------+------------+--------+

dbstore1002
+--------+------------------+------------+--------+
| db     | tbl              | total_rows | chunks |
+--------+------------------+------------+--------+
| frwiki | archive          |        100 |      1 |
| frwiki | page_props       |         99 |      1 |
| frwiki | wbc_entity_usage |        300 |      3 |
+--------+------------------+------------+--------+
+--------+------------------+------------+--------+
| db     | tbl              | total_rows | chunks |
+--------+------------------+------------+--------+
| ruwiki | flaggedimages    |      19019 |    224 |
| ruwiki | flaggedrevs      |      28406 |    287 |
| ruwiki | flaggedtemplates |      27274 |    405 |
| ruwiki | page_props       |      15606 |    159 |
+--------+------------------+------------+--------+

dbstore2001
+--------+--------------+------------+--------+
| db     | tbl          | total_rows | chunks |
+--------+--------------+------------+--------+
| frwiki | bv2009_edits |          0 |   6430 |
| frwiki | cu_changes   |         99 |      1 |
| frwiki | logging      |         99 |      1 |
| frwiki | page         |        100 |      1 |
| frwiki | revision     |         99 |      1 |
| frwiki | user         |        100 |      1 |
+--------+--------------+------------+--------+
+--------+--------------+------------+--------+
| db     | tbl          | total_rows | chunks |
+--------+--------------+------------+--------+
| jawiki | bv2009_edits |          0 |   3293 |
+--------+--------------+------------+--------+
+--------+--------------------+------------+--------+
| db     | tbl                | total_rows | chunks |
+--------+--------------------+------------+--------+
| ruwiki | bv2009_edits       |          0 |   3099 |
| ruwiki | cu_changes         |         99 |      1 |
| ruwiki | externallinks      |        600 |      6 |
| ruwiki | flaggedimages      |         96 |      1 |
| ruwiki | flaggedrevs        |         99 |      1 |
| ruwiki | flaggedrevs_stats  |          0 |      1 |
| ruwiki | flaggedrevs_stats2 |          0 |      1 |
| ruwiki | flaggedtemplates   |         95 |      2 |
| ruwiki | geo_tags           |        115 |      1 |
| ruwiki | logging            |         98 |      1 |
| ruwiki | page_props         |         99 |      1 |
| ruwiki | revision           |         99 |      1 |
| ruwiki | user               |        100 |      1 |
+--------+--------------------+------------+--------+

Plus the tables to check manually.

dbstore1001 and dbstore1002 fixed, checksums for dbstore2001 pending. dbstore2001 seems with smaller errors, 1002 required almost full table reimports.

dbstore2001 fixed (although seeing the pattern, that one probably should be recreated).

Now checking manually on selected hosts:

text
user_properties
watchlist
oldimage

text is almost finished, only missing codfw master and analytics/dbstores.

text is ok, checking user_properties now.

Mentioned in SAL (#wikimedia-operations) [2017-04-18T16:15:00Z] <jynus> reimporting some rows to dbstore1002 on jawiki and ruwiki T160509

user_properties checked and fixed, although not on the delayed slaves as it is not an append-only table. Checking now watchlist.

watchlist looks right everwhere I looked, except dbstore1002, which seems to have errors or undeleted rows.

Only oldimage pending from this shard then?

dbstore1002:watchlist and oldimage everywhere, for the hosts I could or wanted to do (normally that means non-lagged dbstores, all old dbs, old master eqiad, new master eqiad and master codfw). I do not waste time with new dbs because those are clones of old dbs or current masters.

dbstore1002:watchlist and oldimage everywhere, for the hosts I could or wanted to do (normally that means non-lagged dbstores, all old dbs, old master eqiad, new master eqiad and master codfw). I do not waste time with new dbs because those are clones of old dbs or current masters.

Makes total sense to me, just asking to have the whole picture. Thank you!

dbstore1002:watchlist is ok, the errors I found were false positives; now checking more reliably due to the primary keys.

Checking now oldimage.

oldimage checked.

To the best of my ability, no more changes are left on core tables- noting that I have not checked every server, specially those that have filters, lag or row based replication.

I think this means we can decommission safely at least db1050 and db1023.

I would leave db1050 for now (to at least have one old master just in case).
db1023 can go

different rows detected on at least frwiki.archive on db2039.