Hosts compared against db1103:
- db1015
- db1038
- db1044 (sanitarium2 master)
- db1077
- db1078
Hosts compared against db1103:
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | None | T134476 Decommission old coredb machines (<=db1050) | |||
| Resolved | Marostegui | T148078 Decommission db1015, db1035, db1044 and db1038 | |||
| Restricted Task | |||||
| Resolved | Marostegui | T164488 Run pt-table-checksum on s3 | |||
| Resolved | • Cmjohnson | T177911 Decommission db1038 |
s3 is going to be interesting to run the pt-table-checksum on...as not all the tables exist on all the wikis as per my last checks some months ago.
I would start by doing a list of the most important tables we want to check across all the wikis and start with those.
What about:
revision pagelinks templatelinks text imagelinks categorylinks logging archive page recentchanges
They all have PK now \o/
Does that list sound good or are you missing something there?
PS: I have removed the Operations tag to avoid spamming them. I guess it was inherited from the parent task
I am going to start to get ready to run pt-table-checksum on the following tables: revision, pagelinks, templatelinks, text
Mentioned in SAL (#wikimedia-operations) [2017-08-16T06:40:37Z] <marostegui> Run pt-table-checksum on s3 for revision table - T164488
I don't think it is worth it to run pt-table-checksum on s3 right now: https://logstash.wikimedia.org/goto/e7f445600a027bae3e6633ca00a71172
and
T173365
We should just stop db1015, db1028 and db1031, plus one of the new hosts, in sync and run mydumper, then run some parallel diff.
None of those are s3, why would you think running it on s3 could affect s6 or s5?
and
T173365
I saw the raid issue, but I didn't think it would affect it.
We should just stop db1015, db1028 and db1031, plus one of the new hosts, in sync and run mydumper, then run some parallel diff.
ok
Mentioned in SAL (#wikimedia-operations) [2017-08-16T10:46:44Z] <marostegui> Stop replication in sync on db1015 and db1078 - T164488
I have run the first diff for all the revision tables between db1078 and db1015 and there are no differences.
Next table to be checked: pagelinks
Mentioned in SAL (#wikimedia-operations) [2017-08-17T05:48:03Z] <marostegui> Stop replication in sync on db1078 and db1015 - T164488
db1015 has some differences on the following tables with db1078:
abwiki.querycache_info azbwiki.archive abwiki.querycache_info cswikiversity.archive enwikinews.thread_history enwikisource.filearchive enwikivoyage.archive fawikivoyage.archive gomwiki.archive hewikisource.filearchive hewikivoyage.archive iswikisource.filearchive lrcwiki.archive ruwikivoyage.archive
The master and the other hosts, however, looks like db1078 and not like db1015. Some differences are found on dbstore1001 and dbstore1002, which, in some of the above cases, look like db1015 and not the rest.
In some punctual cases db1077 is the only host that looks like db1015 (but not in all the cases), so that host might have been cloned from db1015 at some point and then db1015 have had issues and got different data? Who knows.
If these issues are just so punctual, I would do a full dump (which is already done, thanks Jaime!), store it safely, and then proceed and get rid of db1015 without any more delay.
This host has been depooled for months, so that data has not been read in months. Some of the involved timestamps are from 2013 even.
How confident are about the other tables- did you check all others, only some of them? All dbs, only ones on dblists? Can you tell me more about that?
I have diffed all the tables across all the databases and the ones I mentioned above were the ones with differences across the board.
From those tables that have differences, so far I have manually checked the differences for the following databases:
lrcwiki
abwiki
azbwiki
gomwiki
This is interesting, because I see on abwiki.querycache_info:
5c5
< (test"Ancientpages","20170816050008"),
---
> ("Ancientpages","20170816050008"),Such change does not exist on the live databases:
neodymium:~$ mysql -BN -h db1015.eqiad.wmnet abwiki -e "SELECT * FROM querycache_info ORDER BY qci_type" > db1015.abwiki.querycache_info.txt neodymium:~$ mysql -BN -h db1078.eqiad.wmnet abwiki -e "SELECT * FROM querycache_info ORDER BY qci_type" > db1078.abwiki.querycache_info.txt neodymium:~$ diff db1015.abwiki.querycache_info.txt db1078.abwiki.querycache_info.txt
And understandibly, because that would be a syntax error. Did you modify manually abwiki.querycache_info to test diff working and then forgetting undoing it? I do not rememember doing it, but I would get worried if it wasn't you because it would be causing some kind of sql injection on our dumps :-/
Sorry, that was an "easter egg" I left, to make sure the diff would actually sent some output :)
Sorry, that was an "easter egg" I left, to make sure the diff would actually sent some output :)
Thanks, that is good news- :-) it means mydumper works and we do not have a backdoor in our code breaking backups. Please fix it so we can archive db1015 properly, as you propose, which I agree. Maybe we should tar it after manual checks to avoid consuming too many inodes.
I have fixed it:
root@dbstore1001:/srv/tmp/db1015# zdiff abwiki.querycache_info.sql.gz ../db1078/abwiki.querycache_info.sql.gz root@dbstore1001:/srv/tmp/db1015#
And I have tar'ed it:
/srv/tmp/db1015.tar.gz
Mentioned in SAL (#wikimedia-operations) [2017-10-11T08:37:38Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1103 - T164488 (duration: 00m 47s)
Mentioned in SAL (#wikimedia-operations) [2017-10-11T08:37:54Z] <marostegui> Stop replication in sync on db1103 and db1038 to checksum their data - T164488
I am checksumming db1103 (which has db1035's data) with db1038's data (which is also on db1072 as db1072 was cloned from db1038)
Mentioned in SAL (#wikimedia-operations) [2017-10-13T08:57:30Z] <marostegui> Stop db1103 and db1038 in sync for more checksumming - T164488
Currently fixing inconsistencies on db1072 and db1038 (even though it will be decommissioned, it takes just a few commands to get that one fixed too) and checking all the rest of the hosts for the values that are drifting.
Given the number of wikis to check/fix, this will take some time ;-)
The majority of the issues are on tag_summary and change_tag tables. There are also issues with archive on some wikis.
Change 384448 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1072 to fix data drifts
Change 384448 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1072 to fix data drifts
Mentioned in SAL (#wikimedia-operations) [2017-10-16T07:22:28Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1072 - T164488 (duration: 00m 46s)
Mentioned in SAL (#wikimedia-operations) [2017-10-16T07:30:34Z] <marostegui> Stop db1103 and db1072 at the same position to fix data drifts - T164488
Mentioned in SAL (#wikimedia-operations) [2017-10-16T07:56:30Z] <marostegui> Stop db1103 and db1038 at the same position to fix data drifts - T164488
Mentioned in SAL (#wikimedia-operations) [2017-10-17T06:07:33Z] <marostegui> Stop replication in sync on db1072 and db1103 for data drift fixing - T164488
I am farily confident that all the drifts have been corrected on s3 (I will do another final check tomorrow)
But I would love to do a final check for db1044 (sanitarium master), but that would imply stopping replication for a couple of hours to run mydumper there and generate 2-3 hours lag on s3...so not sure about.
Actually, we can do it- we can use one of the currently depooled hosts as new sanitarium master. OR we do not need to stop replication on db1044- some lag could happen on labsdbs, but of seconds, not hours.
s/we do not need to stop replication on db1044/only stop it to sync it to other host/. I can do either for you, if you want.
Actually, I know what I will do:
I will rebuild db1103 with db1044's data coming from mydumper (no need to stop replication) and then checksum it there with no rush :-)
Once that is verified we can repoint sanitarium to db1072 and we are done.
Mentioned in SAL (#wikimedia-operations) [2017-10-18T05:48:05Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1072 - T164488 (duration: 00m 49s)
Change 385149 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1072
Change 385149 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1072
Mentioned in SAL (#wikimedia-operations) [2017-10-19T08:13:02Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1072 - T164488 (duration: 00m 50s)
Mentioned in SAL (#wikimedia-operations) [2017-10-19T08:13:52Z] <marostegui> Stop replication in sync on db1103 and db1072 - T164488
Mentioned in SAL (#wikimedia-operations) [2017-10-19T15:14:33Z] <marostegui> Stop replication in sync on db1103 and db1072 - https://phabricator.wikimedia.org/T164488
Mentioned in SAL (#wikimedia-operations) [2017-10-20T05:52:33Z] <marostegui> Stop replication in sync on db1103 and db1072 - T164488
Mentioned in SAL (#wikimedia-operations) [2017-10-20T06:06:18Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1072 - T164488 (duration: 00m 46s)
db1044 and db1072 have been checksummed and we can now move db1095 (Sanitarium) under db1072 whenever we like, so we can decom db1044.
Change 385319 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1078 to fix data drifts
Change 385319 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1078 to fix data drifts
Mentioned in SAL (#wikimedia-operations) [2017-10-20T06:15:28Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1078 - T164488 (duration: 00m 45s)
Mentioned in SAL (#wikimedia-operations) [2017-10-20T06:17:42Z] <marostegui> Stop replication in sync on db1103 and db1078 - T164488
Mentioned in SAL (#wikimedia-operations) [2017-10-20T06:26:25Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1078 - T164488 (duration: 00m 46s)
Mentioned in SAL (#wikimedia-operations) [2017-10-23T06:30:05Z] <marostegui> Stop replication in sync on db1103 and db2018 to fix data drifts - T164488
Change 385942 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1078
Change 385942 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1078
Mentioned in SAL (#wikimedia-operations) [2017-10-23T06:47:48Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1078 - T164488 (duration: 00m 47s)
Mentioned in SAL (#wikimedia-operations) [2017-10-23T06:51:27Z] <marostegui> Stop replication in sync on db1078 and db1103 to checksum data - T164488
Mentioned in SAL (#wikimedia-operations) [2017-10-23T14:31:19Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1078 - T164488 (duration: 00m 46s)
Change 386128 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1078
Change 386128 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1078
Mentioned in SAL (#wikimedia-operations) [2017-10-24T06:22:42Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1078 - T164488 (duration: 00m 45s)
Mentioned in SAL (#wikimedia-operations) [2017-10-24T07:28:56Z] <marostegui> Stop replication in sync on db1078 and db1103 to fix data drifts - https://phabricator.wikimedia.org/T164488
Mentioned in SAL (#wikimedia-operations) [2017-10-24T07:40:32Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1078 - T164488 (duration: 00m 46s)
Mentioned in SAL (#wikimedia-operations) [2017-10-24T07:46:37Z] <marostegui> Stop replication in sync on db1103 and db2018 to fix data drifts - T164488
Change 386337 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1077
Change 386337 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1077
Mentioned in SAL (#wikimedia-operations) [2017-10-25T05:42:42Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1077 - T164488 (duration: 01m 00s)
Mentioned in SAL (#wikimedia-operations) [2017-10-25T05:44:40Z] <marostegui> Stop replication in sync on db1103 and db1077 to checksum data - T164488
Mentioned in SAL (#wikimedia-operations) [2017-10-25T10:33:15Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1077 - T164488 (duration: 00m 50s)
db1077 has been checksummed - going to start fixing a few data drifts and we should be good to close this \o/
Change 386578 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1077
Change 386578 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1077
Mentioned in SAL (#wikimedia-operations) [2017-10-26T07:12:56Z] <marostegui> Stop replication in sync on db1103 and db1077 to fix data drifts - T164488
Mentioned in SAL (#wikimedia-operations) [2017-10-26T07:13:27Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1077 - T164488 (duration: 00m 50s)
Mentioned in SAL (#wikimedia-operations) [2017-10-26T07:28:38Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1077 - T164488 (duration: 00m 50s)