Page MenuHomePhabricator

Run pt-table-checksum on s3
Closed, ResolvedPublic

Description

Hosts compared against db1103:

  • db1015
  • db1038
  • db1044 (sanitarium2 master)
  • db1077
  • db1078

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
jcrespo added a parent task: Restricted Task.May 4 2017, 1:53 PM

s3 is going to be interesting to run the pt-table-checksum on...as not all the tables exist on all the wikis as per my last checks some months ago.
I would start by doing a list of the most important tables we want to check across all the wikis and start with those.
What about:

revision
pagelinks
templatelinks
text
imagelinks
categorylinks
logging
archive
page
recentchanges

They all have PK now \o/

Does that list sound good or are you missing something there?

PS: I have removed the Operations tag to avoid spamming them. I guess it was inherited from the parent task

Marostegui moved this task from Triage to Next on the DBA board.May 5 2017, 7:35 AM
Marostegui moved this task from Next to Backlog on the DBA board.Jul 12 2017, 7:37 AM
Marostegui moved this task from Backlog to Next on the DBA board.Jul 14 2017, 7:26 AM
Marostegui moved this task from Next to In progress on the DBA board.

I am going to start to get ready to run pt-table-checksum on the following tables: revision, pagelinks, templatelinks, text

Mentioned in SAL (#wikimedia-operations) [2017-08-16T06:40:37Z] <marostegui> Run pt-table-checksum on s3 for revision table - T164488

jcrespo added a comment.EditedAug 16 2017, 9:40 AM

I don't think it is worth it to run pt-table-checksum on s3 right now: https://logstash.wikimedia.org/goto/e7f445600a027bae3e6633ca00a71172
and
T173365

We should just stop db1015, db1028 and db1031, plus one of the new hosts, in sync and run mydumper, then run some parallel diff.

I don't think it is worth it to run pt-table-checksum on s3 right now: https://logstash.wikimedia.org/goto/e7f445600a027bae3e6633ca00a71172

None of those are s3, why would you think running it on s3 could affect s6 or s5?

and
T173365

I saw the raid issue, but I didn't think it would affect it.

We should just stop db1015, db1028 and db1031, plus one of the new hosts, in sync and run mydumper, then run some parallel diff.

ok

Mentioned in SAL (#wikimedia-operations) [2017-08-16T10:46:44Z] <marostegui> Stop replication in sync on db1015 and db1078 - T164488

I have run the first diff for all the revision tables between db1078 and db1015 and there are no differences.
Next table to be checked: pagelinks

Marostegui updated the task description. (Show Details)Aug 16 2017, 1:05 PM

The diff between db1015 and db1078 for the pagelinks table revealed no differences.

Marostegui updated the task description. (Show Details)Aug 16 2017, 6:23 PM

Mentioned in SAL (#wikimedia-operations) [2017-08-17T05:48:03Z] <marostegui> Stop replication in sync on db1078 and db1015 - T164488

Marostegui updated the task description. (Show Details)Aug 17 2017, 5:51 AM
Marostegui updated the task description. (Show Details)Aug 17 2017, 7:46 AM
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)Aug 17 2017, 12:57 PM

db1015 has some differences on the following tables with db1078:

abwiki.querycache_info
azbwiki.archive
abwiki.querycache_info
cswikiversity.archive
enwikinews.thread_history
enwikisource.filearchive
enwikivoyage.archive
fawikivoyage.archive
gomwiki.archive
hewikisource.filearchive
hewikivoyage.archive
iswikisource.filearchive
lrcwiki.archive
ruwikivoyage.archive

The master and the other hosts, however, looks like db1078 and not like db1015. Some differences are found on dbstore1001 and dbstore1002, which, in some of the above cases, look like db1015 and not the rest.
In some punctual cases db1077 is the only host that looks like db1015 (but not in all the cases), so that host might have been cloned from db1015 at some point and then db1015 have had issues and got different data? Who knows.

If these issues are just so punctual, I would do a full dump (which is already done, thanks Jaime!), store it safely, and then proceed and get rid of db1015 without any more delay.
This host has been depooled for months, so that data has not been read in months. Some of the involved timestamps are from 2013 even.

How confident are about the other tables- did you check all others, only some of them? All dbs, only ones on dblists? Can you tell me more about that?

I have diffed all the tables across all the databases and the ones I mentioned above were the ones with differences across the board.
From those tables that have differences, so far I have manually checked the differences for the following databases:

lrcwiki
abwiki
azbwiki
gomwiki

This is interesting, because I see on abwiki.querycache_info:

5c5
< (test"Ancientpages","20170816050008"),
---
> ("Ancientpages","20170816050008"),

Such change does not exist on the live databases:

neodymium:~$ mysql -BN -h db1015.eqiad.wmnet abwiki -e "SELECT * FROM querycache_info ORDER BY qci_type" > db1015.abwiki.querycache_info.txt
neodymium:~$ mysql -BN -h db1078.eqiad.wmnet abwiki -e "SELECT * FROM querycache_info ORDER BY qci_type" > db1078.abwiki.querycache_info.txt
neodymium:~$ diff db1015.abwiki.querycache_info.txt db1078.abwiki.querycache_info.txt

And understandibly, because that would be a syntax error. Did you modify manually abwiki.querycache_info to test diff working and then forgetting undoing it? I do not rememember doing it, but I would get worried if it wasn't you because it would be causing some kind of sql injection on our dumps :-/

This is interesting, because I see on abwiki.querycache_info:

5c5
< (test"Ancientpages","20170816050008"),
---
> ("Ancientpages","20170816050008"),

Such change does not exist on the live databases:

neodymium:~$ mysql -BN -h db1015.eqiad.wmnet abwiki -e "SELECT * FROM querycache_info ORDER BY qci_type" > db1015.abwiki.querycache_info.txt
neodymium:~$ mysql -BN -h db1078.eqiad.wmnet abwiki -e "SELECT * FROM querycache_info ORDER BY qci_type" > db1078.abwiki.querycache_info.txt
neodymium:~$ diff db1015.abwiki.querycache_info.txt db1078.abwiki.querycache_info.txt

And understandibly, because that would be a syntax error. Did you modify manually abwiki.querycache_info to test diff working and then forgetting undoing it? I do not rememember doing it, but I would get worried if it wasn't you because it would be causing some kind of sql injection on our dumps :-/

Sorry, that was an "easter egg" I left, to make sure the diff would actually sent some output :)

Sorry, that was an "easter egg" I left, to make sure the diff would actually sent some output :)

Thanks, that is good news- :-) it means mydumper works and we do not have a backdoor in our code breaking backups. Please fix it so we can archive db1015 properly, as you propose, which I agree. Maybe we should tar it after manual checks to avoid consuming too many inodes.

Sorry, that was an "easter egg" I left, to make sure the diff would actually sent some output :)

Thanks, that is good news- :-) it means mydumper works and we do not have a backdoor in our code breaking backups. Please fix it so we can archive db1015 properly, as you propose, which I agree. Maybe we should tar it after manual checks to avoid consuming too many inodes.

I have fixed it:

root@dbstore1001:/srv/tmp/db1015# zdiff abwiki.querycache_info.sql.gz ../db1078/abwiki.querycache_info.sql.gz
root@dbstore1001:/srv/tmp/db1015#

And I have tar'ed it:
/srv/tmp/db1015.tar.gz

Marostegui updated the task description. (Show Details)Sep 25 2017, 12:24 PM
Marostegui moved this task from In progress to Next on the DBA board.Oct 5 2017, 11:13 AM

Mentioned in SAL (#wikimedia-operations) [2017-10-11T08:37:38Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1103 - T164488 (duration: 00m 47s)

Mentioned in SAL (#wikimedia-operations) [2017-10-11T08:37:54Z] <marostegui> Stop replication in sync on db1103 and db1038 to checksum their data - T164488

I am checksumming db1103 (which has db1035's data) with db1038's data (which is also on db1072 as db1072 was cloned from db1038)

Mentioned in SAL (#wikimedia-operations) [2017-10-13T08:57:30Z] <marostegui> Stop db1103 and db1038 in sync for more checksumming - T164488

Marostegui moved this task from Next to In progress on the DBA board.Oct 13 2017, 11:41 AM
Marostegui added a comment.EditedOct 13 2017, 12:06 PM

Currently fixing inconsistencies on db1072 and db1038 (even though it will be decommissioned, it takes just a few commands to get that one fixed too) and checking all the rest of the hosts for the values that are drifting.
Given the number of wikis to check/fix, this will take some time ;-)

The majority of the issues are on tag_summary and change_tag tables. There are also issues with archive on some wikis.

Change 384448 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1072 to fix data drifts

https://gerrit.wikimedia.org/r/384448

Change 384448 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1072 to fix data drifts

https://gerrit.wikimedia.org/r/384448

Mentioned in SAL (#wikimedia-operations) [2017-10-16T07:22:28Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1072 - T164488 (duration: 00m 46s)

Mentioned in SAL (#wikimedia-operations) [2017-10-16T07:30:34Z] <marostegui> Stop db1103 and db1072 at the same position to fix data drifts - T164488

Mentioned in SAL (#wikimedia-operations) [2017-10-16T07:56:30Z] <marostegui> Stop db1103 and db1038 at the same position to fix data drifts - T164488

Mentioned in SAL (#wikimedia-operations) [2017-10-17T06:07:33Z] <marostegui> Stop replication in sync on db1072 and db1103 for data drift fixing - T164488

Marostegui updated the task description. (Show Details)Oct 17 2017, 2:25 PM

I am farily confident that all the drifts have been corrected on s3 (I will do another final check tomorrow)
But I would love to do a final check for db1044 (sanitarium master), but that would imply stopping replication for a couple of hours to run mydumper there and generate 2-3 hours lag on s3...so not sure about.

Actually, we can do it- we can use one of the currently depooled hosts as new sanitarium master. OR we do not need to stop replication on db1044- some lag could happen on labsdbs, but of seconds, not hours.

s/we do not need to stop replication on db1044/only stop it to sync it to other host/. I can do either for you, if you want.

Actually, we can do it- we can use one of the currently depooled hosts as new sanitarium master. OR we do not need to stop replication on db1044- some lag could happen on labsdbs, but of seconds, not hours.

Actually, I know what I will do:
I will rebuild db1103 with db1044's data coming from mydumper (no need to stop replication) and then checksum it there with no rush :-)

Once that is verified we can repoint sanitarium to db1072 and we are done.

Mentioned in SAL (#wikimedia-operations) [2017-10-18T05:48:05Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1072 - T164488 (duration: 00m 49s)

Change 385149 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1072

https://gerrit.wikimedia.org/r/385149

Change 385149 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1072

https://gerrit.wikimedia.org/r/385149

Mentioned in SAL (#wikimedia-operations) [2017-10-19T08:13:02Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1072 - T164488 (duration: 00m 50s)

Mentioned in SAL (#wikimedia-operations) [2017-10-19T08:13:52Z] <marostegui> Stop replication in sync on db1103 and db1072 - T164488

Mentioned in SAL (#wikimedia-operations) [2017-10-19T15:14:33Z] <marostegui> Stop replication in sync on db1103 and db1072 - https://phabricator.wikimedia.org/T164488

Mentioned in SAL (#wikimedia-operations) [2017-10-20T05:52:33Z] <marostegui> Stop replication in sync on db1103 and db1072 - T164488

Marostegui updated the task description. (Show Details)Oct 20 2017, 6:02 AM

Mentioned in SAL (#wikimedia-operations) [2017-10-20T06:06:18Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1072 - T164488 (duration: 00m 46s)

db1044 and db1072 have been checksummed and we can now move db1095 (Sanitarium) under db1072 whenever we like, so we can decom db1044.

Change 385319 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1078 to fix data drifts

https://gerrit.wikimedia.org/r/385319

Change 385319 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1078 to fix data drifts

https://gerrit.wikimedia.org/r/385319

Mentioned in SAL (#wikimedia-operations) [2017-10-20T06:15:28Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1078 - T164488 (duration: 00m 45s)

Mentioned in SAL (#wikimedia-operations) [2017-10-20T06:17:42Z] <marostegui> Stop replication in sync on db1103 and db1078 - T164488

Mentioned in SAL (#wikimedia-operations) [2017-10-20T06:26:25Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1078 - T164488 (duration: 00m 46s)

Mentioned in SAL (#wikimedia-operations) [2017-10-23T06:30:05Z] <marostegui> Stop replication in sync on db1103 and db2018 to fix data drifts - T164488

Change 385942 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1078

https://gerrit.wikimedia.org/r/385942

Change 385942 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1078

https://gerrit.wikimedia.org/r/385942

Mentioned in SAL (#wikimedia-operations) [2017-10-23T06:47:48Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1078 - T164488 (duration: 00m 47s)

Mentioned in SAL (#wikimedia-operations) [2017-10-23T06:51:27Z] <marostegui> Stop replication in sync on db1078 and db1103 to checksum data - T164488

Mentioned in SAL (#wikimedia-operations) [2017-10-23T14:31:19Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1078 - T164488 (duration: 00m 46s)

Change 386128 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1078

https://gerrit.wikimedia.org/r/386128

Change 386128 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1078

https://gerrit.wikimedia.org/r/386128

Mentioned in SAL (#wikimedia-operations) [2017-10-24T06:22:42Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1078 - T164488 (duration: 00m 45s)

Mentioned in SAL (#wikimedia-operations) [2017-10-24T07:28:56Z] <marostegui> Stop replication in sync on db1078 and db1103 to fix data drifts - https://phabricator.wikimedia.org/T164488

Mentioned in SAL (#wikimedia-operations) [2017-10-24T07:40:32Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1078 - T164488 (duration: 00m 46s)

Marostegui updated the task description. (Show Details)Oct 24 2017, 7:41 AM

Mentioned in SAL (#wikimedia-operations) [2017-10-24T07:46:37Z] <marostegui> Stop replication in sync on db1103 and db2018 to fix data drifts - T164488

Change 386337 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1077

https://gerrit.wikimedia.org/r/386337

Change 386337 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1077

https://gerrit.wikimedia.org/r/386337

Mentioned in SAL (#wikimedia-operations) [2017-10-25T05:42:42Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1077 - T164488 (duration: 01m 00s)

Mentioned in SAL (#wikimedia-operations) [2017-10-25T05:44:40Z] <marostegui> Stop replication in sync on db1103 and db1077 to checksum data - T164488

Mentioned in SAL (#wikimedia-operations) [2017-10-25T10:33:15Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1077 - T164488 (duration: 00m 50s)

db1077 has been checksummed - going to start fixing a few data drifts and we should be good to close this \o/

Change 386578 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1077

https://gerrit.wikimedia.org/r/386578

Change 386578 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1077

https://gerrit.wikimedia.org/r/386578

Mentioned in SAL (#wikimedia-operations) [2017-10-26T07:12:56Z] <marostegui> Stop replication in sync on db1103 and db1077 to fix data drifts - T164488

Mentioned in SAL (#wikimedia-operations) [2017-10-26T07:13:27Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1077 - T164488 (duration: 00m 50s)

Mentioned in SAL (#wikimedia-operations) [2017-10-26T07:28:38Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1077 - T164488 (duration: 00m 50s)

Marostegui closed this task as Resolved.Oct 26 2017, 10:24 AM
Marostegui updated the task description. (Show Details)

Everything has been checksummed.
I believe s3 is in good shape now.