Page MenuHomePhabricator

Decommission db1009
Closed, ResolvedPublic

Description

db1009 was the old m5 master, and it was failed over to db1073.
Wait a few days and then proceed to decommission it

Decommission Checklist

  • - all system services confirmed offline from production use - should be done by DBA team https://gerrit.wikimedia.org/r/#/c/420293/
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration - should be done by DBA team
  • - any service group puppet/heira/dsh config removed - should be done by DBA team
  • - remove site.pp (replace with role(spare::system) if system isn't shut down immediately during this process.) - should be done by DBA team: https://gerrit.wikimedia.org/r/#/c/420295/
  • - ping @chasemp to get the ACL for 10.64.0.13 cleaned up T189216#4060598 (and IRC)

START NON-INTERRUPPTABLE STEPS - please assign to @RobH for the non-interrupt steps

  • - disable puppet on host
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal) asw-a-eqiad:ge-2/0/8
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

Related Objects

Event Timeline

Marostegui triaged this task as Normal priority.Mar 8 2018, 4:11 PM
Marostegui created this task.
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui updated the task description. (Show Details)

Change 417300 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] site.pp: db1009 is not a master anymore

https://gerrit.wikimedia.org/r/417300

Change 417300 merged by Marostegui:
[operations/puppet@production] site.pp: db1009 is not a master anymore

https://gerrit.wikimedia.org/r/417300

Marostegui added a comment.EditedMar 8 2018, 5:02 PM

A backup of this host is placed at: es2001:/srv/backups/older/m5/db1009_binary_copy/db1009.tar.gz

Due to: T189403 I have started mysql on db1009 and connected it to db1073 temporarly. It is only listening on 127.0.0.1 so nothing can actually connect to it, but I wanted to have it replicating from db1073 till db1073 gets the disk replaced and the raid rebuilt. Better be safe :-)

I don't see it replicating on tendril (I didn't touch it there), is that expected?

@jcrespo is it ok to proceed with this or you're still checking it?

If you don't mind leaving it like that for some more time, so I can run pt-table-checksum on all misc sections?

Sure - that's perfectly ok! :-)

Restricted Application added a project: Operations. · View Herald TranscriptMar 14 2018, 5:04 PM
jcrespo claimed this task.Mar 14 2018, 5:06 PM

wait, robh, I will take this for now- not yet ready for decom.

RobH added a comment.Mar 14 2018, 5:07 PM

I wasn't taking it, was merely tagging in all decom requests with #hw-requests. I left it assigned to @Marostegui ;]

Yeah I never use the other tags till we have it ready from the DBA side, to avoid all the noise for the DC Ops :)

Oh, if they want the noise, they will get it here :-P

RobH added a comment.Mar 14 2018, 6:59 PM

Yeah I never use the other tags till we have it ready from the DBA side, to avoid all the noise for the DC Ops :)

no worries, I wasnt sure so I added. If it is something where you guys don't want us aware until its ready for us to work on, the project can be added later. (I just didn't want it forgotten ;)

Marostegui added a comment.EditedMar 17 2018, 7:17 AM

So, the checks finished and there were differences on testreduce_0715.results (173GB) table, between the following rows:

40590911 40650121

I have confirmed that this is not a pt-table-checksum false positive by dumping those rows with mysqldump and doing a diff.

There are 22 rows that differ (out of around 64M - so not too bad). It is probably not worth at all to fix those, so what I am going to do is:

  • Backup a binary copy of db1009
  • Backup a logical copy of testreduce_0715.results table
  • Backup a diff of the affected differences

After that I will go ahead and decommission this host.

Backup a binary copy of db1009: es2001:/srv/backups/older/m5/db1009_binary_copy/db1009.tar.gz
Backup a logical copy of testreduce_0715.results table: es2001:/srv/backups/older/m5/db1009_results_table/db1009_testreduce_0715_results.sql
Backup a diff of the affected differences: es2001:/srv/backups/older/m5/diff_db1009_db1073.sql

Mentioned in SAL (#wikimedia-operations) [2018-03-19T09:45:48Z] <marostegui> Stop MySQL on db1009 - T189216

Change 420293 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Remove db1009 from config

https://gerrit.wikimedia.org/r/420293

Change 420293 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Remove db1009 from config

https://gerrit.wikimedia.org/r/420293

Mentioned in SAL (#wikimedia-operations) [2018-03-19T09:54:32Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Remove db1009 from config - T189216 (duration: 00m 58s)

Mentioned in SAL (#wikimedia-operations) [2018-03-19T09:55:38Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Remove db1009 from config - T189216 (duration: 00m 57s)

Marostegui updated the task description. (Show Details)Mar 19 2018, 9:56 AM

Change 420295 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Set db1009 to spare

https://gerrit.wikimedia.org/r/420295

Change 420296 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] m5.hosts: Remove db1009

https://gerrit.wikimedia.org/r/420296

Change 420296 merged by jenkins-bot:
[operations/software@master] m5.hosts: Remove db1009

https://gerrit.wikimedia.org/r/420296

Mentioned in SAL (#wikimedia-operations) [2018-03-19T10:25:09Z] <marostegui> Remove db1009 from tendril - T189216

Change 420295 merged by Marostegui:
[operations/puppet@production] mariadb: Set db1009 to spare

https://gerrit.wikimedia.org/r/420295

Marostegui updated the task description. (Show Details)Mar 19 2018, 10:27 AM

@chasemp can you please proceed and remove the ACL for db1009 now?

Marostegui reassigned this task from jcrespo to RobH.Mar 19 2018, 10:28 AM
Marostegui updated the task description. (Show Details)
Marostegui moved this task from In progress to Done on the DBA board.

This host is now ready for DC Ops decommissioning, so assigning it to @RobH

So, the checks finished and there were differences on testreduce_0715.results (173GB) table, between the following rows:

40590911 40650121

I have confirmed that this is not a pt-table-checksum false positive by dumping those rows with mysqldump and doing a diff.
There are 22 rows that differ (out of around 64M - so not too bad). It is probably not worth at all to fix those, so what I am going to do is:

  • Backup a binary copy of db1009
  • Backup a logical copy of testreduce_0715.results table
  • Backup a diff of the affected differences

After that I will go ahead and decommission this host.

I have talked to @ssastry and it has been confirmed that small drift isn't an issue.

RobH updated the task description. (Show Details)Mar 20 2018, 7:12 PM

Change 420821 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decom db1009

https://gerrit.wikimedia.org/r/420821

Change 420821 merged by RobH:
[operations/puppet@production] decom db1009

https://gerrit.wikimedia.org/r/420821

Change 420822 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom db1009 prod dns entries

https://gerrit.wikimedia.org/r/420822

Change 420822 merged by RobH:
[operations/dns@master] decom db1009 prod dns entries

https://gerrit.wikimedia.org/r/420822

RobH reassigned this task from RobH to Cmjohnson.Mar 20 2018, 7:23 PM
RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)
RobH moved this task from Backlog to Decommission on the ops-eqiad board.

ready for onsite disk wipe and completion of steps

Cmjohnson updated the task description. (Show Details)Mar 22 2018, 4:10 PM

Change 421567 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Removing mgmt dns for db1009

https://gerrit.wikimedia.org/r/421567

Change 421567 merged by Cmjohnson:
[operations/dns@master] Removing mgmt dns for db1009

https://gerrit.wikimedia.org/r/421567

Cmjohnson closed this task as Resolved.Mar 23 2018, 4:38 PM
Cmjohnson updated the task description. (Show Details)