Page MenuHomePhabricator

Decommission db1053
Closed, ResolvedPublic

Description

db1053 will be substituted by db1072, then it can be fully decommission.

We still need to:

  • Add pending grants/data to db1073
  • Change m3-slave CNAME to db1073
  • Move backups to db1073
  • Change failover candidate on proxies
  • Failover m3 master to db1072
  • Decommission, too, db1059 T196606

Decommission Checklist

  • - all system services confirmed offline from production use - should be done by DBA team set as spare https://gerrit.wikimedia.org/r/440140
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (disabled alerts)
  • - remove system from all lvs/pybal active configuration - should be done by DBA team not in dblists
  • - any service group puppet/heira/dsh config removed - should be done by DBA team not in hiera
  • - remove site.pp (replace with role(spare::system) if system isn't shut down immediately during this process.) - should be done by DBA team: https://gerrit.wikimedia.org/r/440140

START NON-INTERRUPPTABLE STEPS - please assign to @RobH for the non-interrupt steps

  • - disable puppet on host
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal) asw-a-eqiad:ge-2/0/9
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key

END NON-INTERRUPPTABLE STEPS

  • - mark disk #3 as non usable - must be degaussed for erasure - it has smart errors
  • - mark disk #8 as non usable - must be degaussed for erasure - it has smart errors
  • - mark disk #10 as non usable - must be degaussed for erasure - it has smart errors
  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

Event Timeline

jcrespo created this task.

Change 433141 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Move m3 backups from db1053 to db1072

https://gerrit.wikimedia.org/r/433141

Change 433141 merged by Jcrespo:
[operations/puppet@production] mariadb: Move m3 backups from db1053 to db1072

https://gerrit.wikimedia.org/r/433141

Grants seem fixed. I am not moving the following ones, as they may be unused:

  • fabmigrate
  • bzmigrate
  • rtmigrate

Change 433175 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/dns@master] mariadb: Move m3-slave from db1053 to db1072

https://gerrit.wikimedia.org/r/433175

Change 433175 merged by Jcrespo:
[operations/dns@master] mariadb: Move m3-slave from db1053 to db1072

https://gerrit.wikimedia.org/r/433175

Change 433180 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Make db1072, and not db1053, the passive m3 failover

https://gerrit.wikimedia.org/r/433180

Change 433180 merged by Jcrespo:
[operations/puppet@production] mariadb: Make db1072, and not db1053, the passive m3 failover

https://gerrit.wikimedia.org/r/433180

jcrespo added a subscriber: mmodell.

@mmodell upcoming failover of Phabricator database, heads up (no action needed from you).

Let's make sure we label this disk, somehow, as broken when we decommission this host - so it is not reused in the future to replace other disks:

Enclosure Device ID: 32
			Slot Number: 10
jcrespo updated the task description. (Show Details)
jcrespo added a subscriber: RobH.

Change 438004 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software@master] dblists: Remove db1059 and db1053 for decommission

https://gerrit.wikimedia.org/r/438004

Change 438004 merged by Jcrespo:
[operations/software@master] dblists: Remove db1059 and db1053 for decommission

https://gerrit.wikimedia.org/r/438004

Change 440140 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Remove refereces to db1053 and db1059 and set them as spare

https://gerrit.wikimedia.org/r/440140

Mentioned in SAL (#wikimedia-operations) [2018-06-13T15:25:27Z] <jynus> stopping db1053 and db1059 in preparation for decomm T194634 T196606

Change 440140 merged by Jcrespo:
[operations/puppet@production] mariadb: Remove references to db1053 and db1059 and set them as spare

https://gerrit.wikimedia.org/r/440140

jcrespo updated the task description. (Show Details)
jcrespo moved this task from In progress to Done on the DBA board.
jcrespo edited projects, added decommission-hardware; removed Patch-For-Review.
Vvjjkkii renamed this task from Decommission db1053 to t0caaaaaaa.Jul 1 2018, 1:10 AM
Vvjjkkii removed RobH as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
Marostegui renamed this task from t0caaaaaaa to Decommission db1053.Jul 1 2018, 6:44 PM
Marostegui assigned this task to Cmjohnson.
Marostegui lowered the priority of this task from High to Medium.
Marostegui updated the task description. (Show Details)

Change 446903 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom prod dns for db1053

https://gerrit.wikimedia.org/r/446903

Change 446903 merged by RobH:
[operations/dns@master] decom prod dns for db1053

https://gerrit.wikimedia.org/r/446903

Change 446904 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decom db1053

https://gerrit.wikimedia.org/r/446904

Change 446904 merged by RobH:
[operations/puppet@production] decom db1053

https://gerrit.wikimedia.org/r/446904

RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)
RobH moved this task from Backlog to pending onsite steps (eqiad) on the decommission-hardware board.
RobH added a project: ops-eqiad.
Cmjohnson updated the task description. (Show Details)