run pt-table-checksum on s2 (WAS: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038)
Closed, ResolvedPublic

Description

Updated title task on 24th March 2017: I have changed the title to reflect that it contains the data for s2 as it was the initial shard that was checksummed for testing, so we used this ticket for it. The rest of the shards have their own ticket.

In s3 the following servers will be decommissioned (T134476)

db1038 - old master
db1015 - rc
db1035
db1044 - temporary master for db1095 (sanitarium2)

Before shutting them down for good, we need to make sure we are not losing any data, so we'd need to run pt-table-checksum on them.
Some modifications might be needed in order to run it safely.

There are a very large number of changes, so older changes are hidden. Show Older Changes

I am running the following command now for s2 for the first iteration of testing with real data and real tables (I tested it first with just one table abuse_filter_action and it went fine):

./pt-table-checksum --no-check-binlog-format --replicate=nlwiki.__wmf_checksums --check-slave-lag db2017.codfw.wmnet,db2035.codfw.wmnet,db2041.codfw.wmnet,db2049.codfw.wmnet,db1024.eqiad.wmnet,db1021.eqiad.wmnet,db1036.eqiad.wmnet,db1054.eqiad.wmnet,db1060.eqiad.wmnet --max-lag 1 --chunk-size 100  -F /home/marostegui/.my.cnf  --databases nlwiki --tables abuse_filter_action,abuse_filter_history,abuse_filter,abuse_filter_log,archive,babel,betafeatures_user_counts,blob_orphans,blob_tracking,bv2009_edits,bv2011_edits,bv2015_edits,category,cu_changes,cu_log,ep_articles,ep_cas,ep_courses,ep_events,ep_instructors,ep_oas,ep_orgs,ep_students,externallinks,filearchive,filejournal,geo_tags,global_block_whitelist,image,ipblocks,job,moodbar_feedback,moodbar_feedback_response,ores_classification,ores_model,page,page_props,page_restrictions,pif_edits,povwatch_log,povwatch_subscribers,pr_index,protected_titles,recentchanges,redirect,sites,spoofuser,titlekey,trackbacks,transcode,updatelog,updates,uploadstash,user_groups,user,valid_tag,wbc_entity_usage h=db1018.eqiad.wmnet,u=root --recursion-method=dsn=h=db1011.eqiad.wmnet,D=dsns,t=dsns_s2

revision and logging tables are excluded, and will be done on individual batches for now.
The biggest table of this batch is a 5GB one externalinks

The first iteration of the above run finished:

  • No lag generated
  • 1 error with db1047
02-22T10:20:40 Skipping table nlwiki.protected_titles because on the master it would be checksummed in one chunk but on these replicas it has too many rows:
  236 rows on db1047
  • Differences found on db1036 and db1047 only:
root@neodymium:/home/marostegui/percona-toolkit-2.2.20/bin# ./pt-table-checksum --no-check-binlog-format --replicate-check-only --replicate=nlwiki.__wmf_checksums --check-slave-lag db2017.codfw.wmnet,db2035.codfw.wmnet,db2041.codfw.wmnet,db2049.codfw.wmnet,db1024.eqiad.wmnet,db1021.eqiad.wmnet,db1036.eqiad.wmnet,db1054.eqiad.wmnet,db1060.eqiad.wmnet --max-lag 1 --chunk-size 100  -F /home/marostegui/.my.cnf  --databases nlwiki --tables abuse_filter_action,abuse_filter_history,abuse_filter,abuse_filter_log,archive,babel,betafeatures_user_counts,blob_orphans,blob_tracking,bv2009_edits,bv2011_edits,bv2015_edits,category,cu_changes,cu_log,ep_articles,ep_cas,ep_courses,ep_events,ep_instructors,ep_oas,ep_orgs,ep_students,externallinks,filearchive,filejournal,geo_tags,global_block_whitelist,image,ipblocks,job,moodbar_feedback,moodbar_feedback_response,ores_classification,ores_model,page,page_props,page_restrictions,pif_edits,povwatch_log,povwatch_subscribers,pr_index,protected_titles,recentchanges,redirect,sites,spoofuser,titlekey,trackbacks,transcode,updatelog,updates,uploadstash,user_groups,user,valid_tag,wbc_entity_usage h=db1018.eqiad.wmnet,u=root --recursion-method=dsn=h=db1011.eqiad.wmnet,D=dsns,t=dsns_s2
Differences on db1036
TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY
nlwiki.archive 24161 0 1 PRIMARY 4033673 4033804
nlwiki.archive 24388 0 1 PRIMARY 4070658 4070797
nlwiki.archive 24787 0 1 PRIMARY 4135363 4135480
nlwiki.archive 24788 0 1 PRIMARY 4135481 4135580
nlwiki.archive 26405 0 1 PRIMARY 4375953 4376257
nlwiki.archive 27464 0 1 PRIMARY 4529887 4530040
nlwiki.archive 27654 0 1 PRIMARY 4557604 4557744
nlwiki.archive 28007 0 1 PRIMARY 4606233 4606337

Differences on db1047
TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY
nlwiki.archive 11253 0 1 PRIMARY 1881326 1881425
nlwiki.archive 23777 0 1 PRIMARY 3974940 3975051

Significant table times (in seconds) and size:

375.203 nlwiki.archive - 1.3G
821.313 nlwiki.blob_tracking - 1.1G
765.786 nlwiki.externallinks - 5G
499.988 nlwiki.page_props - 672M
316.525 nlwiki.recentchanges - 1.7G
358.347 nlwiki.titlekey - 496M
486.890 nlwiki.updates - 604M
388.817 nlwiki.wbc_entity_usage - 800M

Given that no lag has been generated at all, chunk-size=100 is probably the way to go. We should try to go now for logging and revision tables on different batches too.

For the record:
Replication got broken on db1069 (sanitarium) with:

Last_Error: Error 'Table 'nlwiki.pr_index' doesn't exist' on query. Default database: 'nlwiki'. Query: 'REPLACE INTO `nlwiki`.`__wmf_checksums` (db, tbl, chunk, chunk_index, lower_boundary, upper_boundary, this_cnt, this_crc) SELECT 'nlwiki', 'pr_index', '1', NULL, NULL, NULL, COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `pr_page_id`, `pr_count`, `pr_q0`, `pr_q1`, `pr_q2`, `pr_q3`, `pr_q4`)) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `nlwiki`.`pr_index` /*checksum table*/'

That happened for all the tables that do exist on production but not in sanitarium/labs

Maybe we should ignore all edits on '%.__wmf_checksums' ? After all, if writes happen on the master, this method will not catch differences due to labs's master using row format...

Maybe we should ignore all edits on '%.__wmf_checksums' ? After all, if writes happen on the master, this method will not catch differences due to labs's master using row format...

Yes, it doesn't make much sense to have it on labs/sanitarium anyways as data itself will mostly be different due to the sanitization

Change 339182 had a related patch set uploaded (by Marostegui):
realm.pp: Add __wmf_checksum table to be ignored

https://gerrit.wikimedia.org/r/339182

Change 339182 merged by Marostegui:
realm.pp: Add __wmf_checksums table to be ignored

https://gerrit.wikimedia.org/r/339182

Mentioned in SAL (#wikimedia-operations) [2017-02-22T15:11:57Z] <marostegui> Restart MySQL on db1069 to apply new replication filters - https://phabricator.wikimedia.org/T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-22T15:21:46Z] <marostegui> Restart MySQL on db1095 to apply new replication filters - https://phabricator.wikimedia.org/T154485

I have restarted db1069 and db1095 and they have the new filter applied:

%.__wmf_checksums

Mentioned in SAL (#wikimedia-operations) [2017-02-23T07:59:14Z] <marostegui> Run pt-table-checksum on s2 (nlwiki) on logging table - T154485

I have started to run pt-table-checksum only for the logging table (17GB) and the expected time reported by pt-table-checksum is around 40 minutes. I am closely monitoring lag on the shard.
After that if all goes fine I will go for revision (23G)

It finished successfully in 44 minutes:

            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
02-23T08:45:43      0      0 26747597  267478       0 2668.835 nlwiki.logging

I am going to go for revision now.

Mentioned in SAL (#wikimedia-operations) [2017-02-23T08:51:04Z] <marostegui> Run pt-table-checksum on s2 (nlwiki) on revision table - T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-23T08:54:38Z] <marostegui> Stop pt-table-checksum on nlwiki.revision - T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-23T09:09:27Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1036 - T154485 (duration: 00m 40s)

Mentioned in SAL (#wikimedia-operations) [2017-02-23T09:23:36Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1036 - T154485 (duration: 00m 40s)

Some seconds after starting the pt-table-checksum for the revision table I had to kill it as both recentchanges slaves in eqiad and codfw (with the revision table partitioned of course) started to lag behind, to the point where I have had to decommission db1036 as it is lagging behind (1000 seconds now).
logging table is also partitioned and there was no lag there in any of those two hosts., so maybe for revision tables we have to go for a lower chunk (it is 100 now). The ETA for revision with chunk-size was 1:22h.
The crtl+c showed:

Checksumming nlwiki.revision:   0% 01:22:15 remain
^C# Caught SIGINT.
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
02-23T08:54:18      0      0   504100    5041       0  56.792 nlwiki.revision
TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
02-23T08:54:18      0      0   504100    5041

Which indeed shows chunks of 100 rows.

Once the server have caught up we can try to decrease the chunks to 10, keep db1036 depooled (db1063 is serving recentchanges now) and see what happens.

db1036 and db2035 were still lagging behind more than 30k:

˜/jynus 19:26> it is now doing a full table scan for every chunk

So we decided to add a replication filter on those two hosts to ignore that table and let the servers catch up:

root@PRODUCTION s2[(none)]> stop slave;
Query OK, 0 rows affected (26.81 sec)

root@PRODUCTION s2[(none)]> set global Replicate_Wild_Ignore_Table = "nlwiki.__wmf_checksums";
Query OK, 0 rows affected (0.00 sec)

root@PRODUCTION s2[(none)]> start slave;
Query OK, 0 rows affected (0.72 sec)

db1036 and db2035 were still lagging behind more than 30k:

˜/jynus 19:26> it is now doing a full table scan for every chunk

So we decided to add a replication filter on those two hosts to ignore that table and let the servers catch up:

root@PRODUCTION s2[(none)]> stop slave;
Query OK, 0 rows affected (26.81 sec)

root@PRODUCTION s2[(none)]> set global Replicate_Wild_Ignore_Table = "nlwiki.__wmf_checksums";
Query OK, 0 rows affected (0.00 sec)

root@PRODUCTION s2[(none)]> start slave;
Query OK, 0 rows affected (0.72 sec)

I am tempted to leave those replication filters there and try to run the pt-table-checksum again on Monday to see, if at least, it is a viable option to get the whole schema tested without any lag on the revision table (obviously those two slaves wouldn't get checked for consistency).

I wouldn't be against it, but I would ask to start puppetizing that- Something like a replication filter, and add those through hiera with a new production.my option. If the servers restart, we will create issues (I think replication options are lost).

I wouldn't be against it, but I would ask to start puppetizing that- Something like a replication filter, and add those through hiera with a new production.my option. If the servers restart, we will create issues (I think replication options are lost).

I am still not completely sure I want this for good, as it would mean making those servers (or all the recentchanges ones) even more special. We can start thinking about it though, so far let's see if the test goes fine on Monday morning and we can discuss how to proceed further.
Ideally we would be able to do
if role == recentchanges then replication filter for __wmf_checksum
Ideal world! :-)

Change 339609 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Repool db1036

https://gerrit.wikimedia.org/r/339609

Change 339609 merged by jenkins-bot:
db-eqiad.php: Repool db1036

https://gerrit.wikimedia.org/r/339609

Mentioned in SAL (#wikimedia-operations) [2017-02-24T09:48:14Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1036 - T154485 (duration: 00m 40s)

Mentioned in SAL (#wikimedia-operations) [2017-02-27T07:59:41Z] <marostegui> Run pt-table-checksum on s2 (nlwiki) on revision table - T154485

I am running pt-table-checksum again on s2 (keeping the replication filters for the rc slaves - db1036 and db2035.) I have added --no-check-replication-filters in order to bypass the rc slaves.
ETA for revision table to be checked 1:20h

The table has been checksummed already and there was no lag generated, it took around 1:30h in the end. Now it is checking for the differences on the slaves, and that is going to take a huge amount of time which is probably just too much for a 23G table:

Waiting to check replicas for differences:   0% 2+02:53:47 remain

It was increasing all the time.
I have ctrl+C and will probably run with PTDEBUG=1 to see what is actually doing and see if it is getting stuck somewhere, because it seems too much, and if it is true, there is no way we can run this on enwiki.

@Marostegui Are you checking the dbs with filters? It may be stuck trying to read the rows from those hosts, which will never arrive: http://kedar.nitty-witty.com/blog/pt-table-checksum-waiting-to-check-replicas-for-differences-0-0000-remain

@Marostegui Are you checking the dbs with filters? It may be stuck trying to read the rows from those hosts, which will never arrive: http://kedar.nitty-witty.com/blog/pt-table-checksum-waiting-to-check-replicas-for-differences-0-0000-remain

Yes, I was just checking the dsns table on tendril because I thought I had remove those two slaves when we added the filters on Thursday, but apparently I didn't so it was definitely checking them.

You can keep them on the replication list, so they cannot lag, but run it with --no-replicate-check, we can do the checks later, independently.

Once removed those two slaves from the dsns_s2 table, the check was completed successfully:

02-27T11:32:17      0      0 45398597  453988       0 5878.306 nlwiki.revision

Checking for differences reveals that all slaves have the same data in the revision table (obviously the rc slaves are excluded for this check, but given that they didn't have any differences for the rest of the tables, it is likely that for this table they are probably consistent as well).

I've got a list of tables per wiki from s2 that do not have a PK (unfortunately they are not the same across all the wikis, although they are pretty similar). And I have built a pt-table-checksum command per database and with the tables to ignore per database as well. So I would like to run it (one by one) for all s2 shard and see what we get.

Will start everyday in the morning to avoid going in the evening. This run will no longer split between tables, and will include revision and logging tables in the same batch. We haven't seen any delays, so it should be okay to run in, again, one database at the time.

Mentioned in SAL (#wikimedia-operations) [2017-02-28T07:12:09Z] <marostegui> run pt-table-checksum on bgwiki (s2) - T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-28T08:24:52Z] <marostegui> run pt-table-checksum on bgwiktionary (s2) - T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-28T08:59:33Z] <marostegui> run pt-table-checksum on cswiki (s2) - T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-28T10:23:29Z] <marostegui> run pt-table-checksum on enwikiquote (s2) - T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-28T10:53:32Z] <marostegui> run pt-table-checksum on enwiktionary (s2) - T154485

Mentioned in SAL (#wikimedia-operations) [2017-02-28T14:32:29Z] <marostegui> run pt-table-checksum on eowiki (s2) - T154485

The following wikis were checksummed in s2 during the day with no issues:

bgwiki - no differences
bgwiktionary - no differences
cswiki - no differences
enwikiquote - no differences
enwiktionary - db1047 has different data in the archive table
eowiki - stopped because it was getting to late and I didn't want it to run if I was not present (will --resume tomorrow, it was already on the revision table, which ETA was 40 minutes)

I have resumed the run on eowiki:

Resuming from eowiki.revision chunk 932, timestamp 2017-02-28 17:13:33
Checksumming eowiki.revision:   3% 13:40 remain

Mentioned in SAL (#wikimedia-operations) [2017-03-01T13:16:33Z] <marostegui> run pt-table-checksum on idwiki - T154485

Mentioned in SAL (#wikimedia-operations) [2017-03-01T15:45:52Z] <marostegui> Resume pt-table-checksum on idwiki (s2) - T154485

Wikis tested today:
eowiki - no differences
fiwiki - no differences
idwiki - stopped on the page table as it is pretty much the end of the day and I didn't want to leave it running, although at this point I am fairly confident that it wouldn't cause any issue, it has been running in the background and unattended the whole day.

Mentioned in SAL (#wikimedia-operations) [2017-03-02T07:09:51Z] <marostegui> Resume pt-table-checksum on idwiki (s2) - T154485

Mentioned in SAL (#wikimedia-operations) [2017-03-02T08:27:11Z] <marostegui> Start pt-table-checksum on itwiki (s2)  - T154485

Today's checksums
idwiki - after resuming it today, db1047 has differences on one table:
idwiki.geo_tags 1 0 1 PRIMARY 988797 1916637

itwiki - dbstore1002 has differences on:

itwiki.wbc_entity_usage 26352 1 1 PRIMARY 4982283 4982418
itwiki.wbc_entity_usage 41520 -1 1 PRIMARY 6987621 6987766

Mentioned in SAL (#wikimedia-operations) [2017-03-03T08:22:25Z] <marostegui> Run pt-table-checksum on s2 (nowiki) - T154485

Mentioned in SAL (#wikimedia-operations) [2017-03-03T10:53:38Z] <marostegui> Start pt-table-checksum on plwiki (s2) - T154485

Today checksums:
nowiki - dbstore1002 and db1047 have differences:

Differences on db1047
TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY
nowiki.archive 5811 0 1 PRIMARY 631936 632068

Differences on dbstore1002
TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY
nowiki.archive 5811 0 1 PRIMARY 631936 632068
nowiki.archive 6170 0 1 PRIMARY 681436 681558
nowiki.archive 6171 0 1 PRIMARY 681559 681716
nowiki.page_props 9346 -1 1 PRIMARY 1275171,1275171,wikibase_item 1275273,1275273,displaytitle
nowiki.page_props 9347 -1 1 PRIMARY 1275273,1275273,wikibase_item 1275466,1275466,defaultsort
nowiki.page_props 9507 -7 1 PRIMARY 1296037,1296037,wikibase_item 1296130,1296130,wikibase_item
nowiki.page_props 9509 -1 1 PRIMARY 1296213,1296213,defaultsort 1296312,1296312,wikibase_item
nowiki.page_props 9510 -5 1 PRIMARY 1296313,1296313,page_image_free 1296422,1296422,defaultsort
nowiki.page_props 9736 -1 1 PRIMARY 1319889,1319889,displaytitle 1320027,1320027,wikibase_item
nowiki.page_props 9737 -2 1 PRIMARY 1320028,1320028,page_image_free 1320129,1320129,wikibase_item
nowiki.page_props 9738 -14 1 PRIMARY 1320130,1320130,wikibase_item 1320231,1320231,wikibase_item
nowiki.page_props 9739 -2 1 PRIMARY 1320232,1320232,wikibase_item 1320303,1320303,defaultsort
nowiki.page_props 9740 -1 1 PRIMARY 1320303,1320303,page_image_free 1320400,1320400,wikibase_item
nowiki.page_props 9743 -3 1 PRIMARY 1320554,1320554,page_top_level_section_count 1320634,1320634,wikibase_item
nowiki.page_props 9746 -2 1 PRIMARY 1320728,1320728,defaultsort 1320820,1320820,wikibase_item
nowiki.page_props 9750 -1 1 PRIMARY 1321070,1321070,wikibase_item 1321125,1321125,page_image_free
nowiki.page_props 9802 -3 1 PRIMARY 1328742,1328742,wikibase_item 1328896,1328896,wikibase_item
nowiki.page_props 9803 -1 1 PRIMARY 1328897,1328897,defaultsort 1329046,1329046,wikibase_item
nowiki.page_props 9804 -8 1 PRIMARY 1329047,1329047,wikibase_item 1329190,1329190,wikibase_item
nowiki.page_props 9806 -6 1 PRIMARY 1329300,1329300,wikibase_item 1329429,1329429,displaytitle
nowiki.page_props 9808 -1 1 PRIMARY 1329599,1329599,wikibase_item 1329748,1329748,wikibase_item
nowiki.page_props 9811 -1 1 PRIMARY 1330082,1330082,wikibase_item 1330344,1330344,wikibase_item
nowiki.page_props 9812 -5 1 PRIMARY 1330345,1330345,wikibase_item 1330463,1330463,wikibase_item
nowiki.page_props 9818 -8 1 PRIMARY 1331053,1331053,wikibase_item 1331152,1331152,wikibase_item
nowiki.page_props 9819 -1 1 PRIMARY 1331153,1331153,wikibase_item 1331263,1331263,wikibase_item
nowiki.page_props 9820 -1 1 PRIMARY 1331265,1331265,page_top_level_section_count 1331392,1331392,wikibase_item
nowiki.page_props 9821 -3 1 PRIMARY 1331393,1331393,wikibase_item 1331510,1331510,wikibase_item
nowiki.page_props 9824 -1 1 PRIMARY 1331798,1331798,wikibase_item 1331907,1331907,wikibase_item
nowiki.page_props 9833 -1 1 PRIMARY 1332907,1332907,wikibase_item 1333008,1333008,page_image_free

plwiki has been stopped and will be resumed on Monday. Starting on Monday I will do all the wikis non-stop, as during this week there has been 0 problems while running them.

Mentioned in SAL (#wikimedia-operations) [2017-03-06T07:22:04Z] <marostegui> Resume pt-table-checksum on plwiki (s2) - T154485

I have started the checksum on plwiki again after all the crashes with db1060

Mentioned in SAL (#wikimedia-operations) [2017-03-08T07:27:12Z] <marostegui> Start pt-table-checksum on plwiki (s2) - T154485

It took a long run to finish plwiki, but here are the results.
differences in db1047 and quite lots of them in dbstore1002 too.

Marostegui moved this task from Backlog to In progress on the DBA board.Mar 13 2017, 9:52 AM

ptwiki:
differences on db1047 and dbstore1002.

The checksum on svwiki stopped due to db1054 going down for maintenance. I will resume once it is back up again.

I am sorry :-(.

No worries whatsoever! This is now a background task that doesn't require much babysitting, so no problem at all!

svwiki has no differences
thwiki - differences on db1047 and dbstore1002
trwiki - in progress now

db1047:

db      tbl     total_rows      chunks
enwiktionary    archive 100     1
idwiki  geo_tags        100     1
nlwiki  archive 200     2
nowiki  archive 100     1
plwiki  archive 500     5
ptwiki  geo_tags        100     1
thwiki  geo_tags        100     1
zhwiki  archive 100     1

Good and bad news:
{P5061}
There are no differences in data, but the primary key values differ, probably because those were inserted with the unsafe INSERT...SELECT and different locking patterns happened. We should make sure that is no longer happening.

I intend to run:

(echo "SET SESSION sql_log_bin=0; " && mysqldump --single-transaction -h db1018.eqiad.wmnet enwiktionary archive --where "ar_id >= 1068351 and ar_id <= 1068357" --no-create-info --skip-add-locks --replace | grep REPLACE ) | mysql -h db1047.eqiad.wmnet enwiktionary

to fix db1047:enwiktionary

I intend to run:

(echo "SET SESSION sql_log_bin=0; " && mysqldump --single-transaction -h db1018.eqiad.wmnet enwiktionary archive --where "ar_id >= 1068351 and ar_id <= 1068357" --no-create-info --skip-add-locks --replace | grep REPLACE ) | mysql -h db1047.eqiad.wmnet enwiktionary

to fix db1047:enwiktionary

Looks good to me - thanks!! :)

I have done the above, there is no longer a diff for db1047 on enwiktionary.archive. I will repeat the steps for all detected changes. Some important tables are skipped (text)- I may have to check those manually stopping a slave and exporting its contents.

The last database for s2 was checksummed and these are the results:

Differences on db1047
TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY
zhwiki.archive 19830 0 1 PRIMARY 2339129 2339333

There are also differences on dbstore1002.

So we can say that s2 has been checksummed entirely (for all the tables with PKs)

I have actually fixed all db1047 found differences. I am working now on dbstore1002.

dbstore1002:

$ mysql -h $slave nlwiki -e "SELECT db, tbl, SUM(this_cnt) AS total_rows, COUNT(*) AS chunks FROM __wmf_checksums WHERE ( master_cnt <> this_cnt OR master_crc <> this_crc OR ISNULL(master_crc) <> ISNULL(this_crc)) GROUP BY db, tbl"
+--------------+------------------+------------+--------+
| db           | tbl              | total_rows | chunks |
+--------------+------------------+------------+--------+
| bgwiki       | page_props       |       8741 |     91 |
| cswiki       | archive          |        100 |      1 |
| cswiki       | page_props       |       3791 |     39 |
| enwiktionary | page_props       |      26740 |    338 |
| eowiki       | flaggedimages    |        757 |     10 |
| eowiki       | flaggedrevs      |       2665 |     27 |
| eowiki       | flaggedtemplates |       1578 |     22 |
| eowiki       | page_props       |       6171 |     63 |
| fiwiki       | flaggedimages    |        573 |      7 |
| fiwiki       | flaggedrevs      |      21657 |    219 |
| fiwiki       | flaggedtemplates |       1975 |     23 |
| fiwiki       | page_props       |       3385 |     35 |
| idwiki       | flaggedimages    |        193 |      2 |
| idwiki       | flaggedrevs      |      11470 |    116 |
| idwiki       | flaggedtemplates |        230 |      3 |
| idwiki       | geo_tags         |        100 |      1 |
| itwiki       | page_props       |       1978 |     20 |
| itwiki       | wbc_entity_usage |        200 |      2 |
| nlwiki       | archive          |        100 |      1 |
| nlwiki       | page_props       |        297 |      3 |
| nowiki       | archive          |        300 |      3 |
| nowiki       | page_props       |       2519 |     26 |
| plwiki       | flaggedimages    |        390 |      4 |
| plwiki       | flaggedrevs      |      15332 |    155 |
| plwiki       | flaggedtemplates |        702 |      8 |
| plwiki       | logging          |         99 |      1 |
| plwiki       | page_props       |       5873 |     61 |
| plwiki       | spoofuser        |         99 |      1 |
| plwiki       | user             |         99 |      1 |
| ptwiki       | archive          |        100 |      1 |
| ptwiki       | geo_tags         |        100 |      1 |
| ptwiki       | page_props       |        688 |      7 |
| thwiki       | geo_tags         |        100 |      1 |
| thwiki       | page_props       |        692 |      7 |
| trwiki       | flaggedimages    |       1930 |     23 |
| trwiki       | flaggedrevs      |      11182 |    113 |
| trwiki       | flaggedtemplates |       2335 |     33 |
| trwiki       | page_props       |        983 |     10 |
| zhwiki       | page_props       |        875 |      9 |
+--------------+------------------+------------+--------+

dbstore1001 - it is 24 hours behind, so it has to be checked tomorrow again:

root@dbstore1001[nlwiki]> SELECT db, tbl, SUM(this_cnt) AS total_rows, COUNT(*) AS chunks FROM __wmf_checksums WHERE ( master_cnt <> this_cnt OR master_crc <> this_crc OR ISNULL(master_crc) <> ISNULL(this_crc)) GROUP BY db, tbl;
+--------+----------+------------+--------+
| db     | tbl      | total_rows | chunks |
+--------+----------+------------+--------+
| cswiki | geo_tags |          2 |      1 |
| idwiki | geo_tags |        100 |      1 |
| nlwiki | page     |        100 |      1 |
| nlwiki | revision |         99 |      1 |
| nlwiki | user     |        100 |      1 |
| ptwiki | geo_tags |        100 |      1 |
| thwiki | geo_tags |        100 |      1 |
+--------+----------+------------+--------+
7 rows in set (6 min 9.99 sec)

db1021 and db1024- no changes.

db1036.eqiad.wmnet:

$ mysql -h $slave nlwiki -e "SELECT db, tbl, SUM(this_cnt) AS total_rows, COUNT(*) AS chunks FROM __wmf_checksums WHERE ( master_cnt <> this_cnt OR master_crc <> this_crc OR ISNULL(master_crc) <> ISNULL(this_crc)) GROUP BY db, tbl"
+--------+---------+------------+--------+
| db     | tbl     | total_rows | chunks |
+--------+---------+------------+--------+
| nlwiki | archive |        800 |      8 |
+--------+---------+------------+--------+

I will focus on these so they can be decommissioned ASAP.

I finished dbstore1002 too (again, only for the tables with PKs) for s2, I will fix dbstore1001 now.

jcrespo claimed this task.Mar 16 2017, 2:26 PM

dbstore1001 is fixed for s2, again with the exception of tables without PKs and zhwiki, which is finishing at this point. Going with db1036.

db1036 is now ok. dbstore2001.codfw.wmnet was all ok, but it is missing several rows from bv20**_edits tables (they are empty there).

I have checked all s2 slaves, including codfw and those not part of the specific lists. All either didn't have differences or have been corrected. The following things are to be taken into account: db1036 not fully checked because it is a "special slave". Labs ones (including db1069) are in a bad state, even if having into account filter differences, but they are going to be decommissioned soon. codfw slaves are all ok, but MIXED replication on the codfw master could kick in (master is good, however). New labs uses row, so they cannot be checked reliably.

I will now try to check some extra important tables on s2, at least:

oldimage
text
user_props

Most of the rest are cache/summary tables/derived data from parsing, etc.

Marostegui renamed this task from run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038 to run pt-table-checksum on s2 (WAS: run pt-table-checksum before decommissioning db1015, db1035,db1044,db1038).Mar 24 2017, 9:03 AM
Marostegui updated the task description. (Show Details)

Checking text now, at least on the master, dbstore1002 and the hosts to be decommissioned:

$ while read db; do for host in db1018 db1024 db1021 db1036; do echo $host:$db; python3 compare.py dbstore1002 $host $db text old_id --step=10000; done; done < s2.dblist
db1018:bgwiki
No differences found.
db1024:bgwiki
No differences found.
db1021:bgwiki
No differences found.
db1036:bgwiki
No differences found.
db1018:bgwiktionary
...

I have to do a second pass for : idwiki itwiki nlwiki nowiki plwiki ptwiki svwiki trwiki zhwiki

All others had no differences- MySQL complained on those of comparing too many rows at a time, so I had to reduce the batch to 1000 rows (which will be slower).

Only diff:

db1036:svwiki
Rows are different WHERE old_id BETWEEN 27558001 AND 27559000
$ python3 compare.py dbstore1002 db1036 svwiki text old_id --from-value=27558001 --to-value=27559000 --step=1
Rows are different WHERE old_id BETWEEN 27558799 AND 27558799

They are not toys! They are useful:

$ ./sql.py -h dbstore1002.eqiad.wmnet svwiki -e "SELECT * FROM text WHERE old_id BETWEEN 27558799 AND 27558799" --no-dry-run

Results for dbstore1002.eqiad.wmnet:3306/svwiki:
+----------+-----------------+-------------+------------------------+---------------+------------+-----------------+-----------------+------------------+---------------------+---------------------+
|   old_id |   old_namespace | old_title   | old_text               | old_comment   |   old_user | old_user_text   | old_timestamp   |   old_minor_edit | old_flags           | inverse_timestamp   |
|----------+-----------------+-------------+------------------------+---------------+------------+-----------------+-----------------+------------------+---------------------+---------------------|
| 27558799 |               0 |             | DB://cluster25/XXXXXXX |               |          0 |                 |                 |                0 | utf-8,gzip,external |                     |
+----------+-----------------+-------------+------------------------+---------------+------------+-----------------+-----------------+------------------+---------------------+---------------------+
1 row in set (0.00 sec)

$ ./sql.py -h db1036.eqiad.wmnet svwiki -e "SELECT * FROM text WHERE old_id BETWEEN 27558799 AND 27558799" --no-dry-run

Results for db1036.eqiad.wmnet:3306/svwiki:
0 rows in set (0.00 sec)

text is now ok after reimport, oldimage, too (with some false positives due to some deleted revisions out of order).

I am going over user_props now, and I have detected something, but lost the output despite running on a screen- it could be a false positive or an actual problem. I will restart the check soon again.

jcrespo added a comment.EditedMar 31 2017, 11:15 AM

There are a lot of differences on user_properties between the master and dbstore1002 on:

itwiki
plwiki
svwiki

One (maybe) on:

zhwiki

This needs further research.

Differences on dbstore1002:

itwiki: --where "up_user BETWEEN 0 AND 1009999"
nowiki: --where "up_user BETWEEN 70000 AND 79999"
        --where "up_user BETWEEN 100000 AND 109999"
plwiki: --where "up_user BETWEEN 0 AND 59999"
svwiki: --where "up_user BETWEEN 0 AND 9999"
trwiki: --where "up_user BETWEEN 950000 AND 957402"

differences on db1047:

itwiki: --where "up_user BETWEEN 0 AND 1009999"
nowiki: --where "up_user BETWEEN 100000 AND 109999"
plwiki: --where "up_user BETWEEN 0 AND 59999"
svwiki: --where "up_user BETWEEN 0 AND 9999"

Mentioned in SAL (#wikimedia-operations) [2017-04-07T07:21:45Z] <jynus> reimporting several damaged db tables on s2 T154485

Everything seems ok or has been corrected (there were some false positives above). I am going to give a quick look at watchlist and then close this ticket.

Everything seems ok or has been corrected (there were some false positives above). I am going to give a quick look at watchlist and then close this ticket.

Thanks a lot for all the archeology work you've done. It is amazing.

db1018 and db1021 have the exact same watchlist table, we can now use the slave to compare it to the other servers (next week).

That is great, one step further to be able to decommission those hosts finally :-)

jcrespo closed this task as Resolved.EditedApr 11 2017, 3:02 PM

The modified task is for me complete- run pt-table-checksum on s2. I have not checked every table and every server- but I am mostly confident with the results we got and the corrections made to decommission the older servers.

Aside from pt-table-checksums on those servers where it is possible (not labs or non-row), I have checked non derivative data on all old servers, the master, db1054, db1047, dbstore1002 and codfw master.

Yaaaaaaaay!!! :-)
Thanks for the hard archeology work!