Page MenuHomePhabricator

Decommission dbstore2001.codfw.wmnet and dbstore2002.codfw.wmnet
Closed, ResolvedPublic

Description

The following two hosts are ready to be decommissioned

dbstore2001

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

dbstore2002

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - Label the BBU as broken so it doesn't get re-used T214264
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

Event Timeline

jcrespo created this task.Apr 3 2019, 3:19 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 3 2019, 3:19 PM
Marostegui added a comment.EditedApr 3 2019, 3:21 PM

When creating the final task to decommission dbstore2002 please make sure to create a point to label the BBU as broken on the DCOps onsite steps.
Thanks!

jcrespo updated the task description. (Show Details)Apr 3 2019, 3:25 PM
jcrespo renamed this task from Decomission dbstore1001, dbstore2001, dbstore2002 and es2001-4 hosts* to Decommission dbstore1001, dbstore2001, dbstore2002 and es2001-4 hosts*.Apr 3 2019, 3:44 PM
jcrespo claimed this task.
jcrespo triaged this task as Medium priority.
jcrespo moved this task from Triage to In progress on the DBA board.

Change 507944 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backups: Decommission dbstore1001, dbstore2001 and dbstore2002

https://gerrit.wikimedia.org/r/507944

Would you mind leaving dbstore1001 as the final host?
I have emailed Chase and John to sync-up about something temporary being stored on dbstore1001, to check if it needs to be moved somewhere else or can be gone for good. Just emailed them, waiting for the reply.

Thanks!

Does T220002#5158901 conflict with setting it as spare? I wanted to set it as spare soon-ish, decom later.

Does T220002#5158901 conflict with setting it as spare? I wanted to set it as spare soon-ish, decom later.

No, no conflict. Spare is good!

jcrespo renamed this task from Decommission dbstore1001, dbstore2001, dbstore2002 and es2001-4 hosts* to Decommission dbstore1001, dbstore2001, dbstore2002.May 6 2019, 9:33 AM
jcrespo updated the task description. (Show Details)

Change 507944 merged by Jcrespo:
[operations/puppet@production] backups: Decommission dbstore1001, dbstore2001 and dbstore2002

https://gerrit.wikimedia.org/r/507944

Change 508565 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] dbstores: Set the right role for dbstores to be spare

https://gerrit.wikimedia.org/r/508565

Change 508565 merged by Jcrespo:
[operations/puppet@production] dbstores: Set the right role for dbstores to be spare

https://gerrit.wikimedia.org/r/508565

Mentioned in SAL (#wikimedia-operations) [2019-05-07T12:45:26Z] <jynus> remove dbstore1001, dbstore2001, dbstore2002 from tendril and zarcillo T220002

MySQL and Prometheus have been stopped on the above hosts. This is almost ready, only pending wait some time and see if there is something we would like to keep from these old hosts.

I got green light from Chase via email to decom these hosts

@Marostegui Ok if I send them to DC Ops?

jcrespo reassigned this task from jcrespo to Cmjohnson.Jun 5 2019, 2:21 PM
jcrespo moved this task from In progress to Blocked external/Not db team on the DBA board.
jcrespo added a project: DC-Ops.
jcrespo added subscribers: Cmjohnson, RobH.

@RobH @Cmjohnson These 3 hosts are ready to be decommissioned. Alerts have been disabled, roles deleted (it is still a spare role), and services stopped. Data should be safely deleted as these used to handle sensitive data (database backups). Some of the disks are newer than the hosts, so feel free to save those if you think they could be useful.

We can help with decom tasks if asked, but this is mostly on your hands now.

jcrespo reassigned this task from Cmjohnson to RobH.Jun 5 2019, 2:26 PM
jcrespo updated the task description. (Show Details)
jcrespo updated the task description. (Show Details)
jcrespo updated the task description. (Show Details)Jun 5 2019, 2:28 PM
jcrespo updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)Jun 5 2019, 2:30 PM

I've marked the hosts as "decomissioning", the template and the wiki seem to be outdated and unclear what to do?

I've moved them to active as per volans' advice.

Assigning this to myself to indicate I am using dbstore1001 for a few days as storing the content of db1112 (test cluster data) temporarily - once I have finished this I will reassign back to Rob

Please do not use dbstores, use dbprov instead.

it is temporary and it won't last more than 2 days, but ok

Marostegui reassigned this task from Marostegui to RobH.Jun 18 2019, 6:20 AM

cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: dbstore2001.codfw.wmnet

  • dbstore2001.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 543035 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Remove dbstore2001 references

https://gerrit.wikimedia.org/r/543035

Change 543036 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Remove production DNS entries for dbstore2001

https://gerrit.wikimedia.org/r/543036

Change 543035 merged by Marostegui:
[operations/puppet@production] mariadb: Remove dbstore2001 references

https://gerrit.wikimedia.org/r/543035

Change 543036 merged by Marostegui:
[operations/dns@master] wmnet: Remove production DNS entries for dbstore2001

https://gerrit.wikimedia.org/r/543036

Marostegui updated the task description. (Show Details)Oct 15 2019, 7:45 AM

Change 545429 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Remove puppet references for dbstore2002

https://gerrit.wikimedia.org/r/545429

Change 545430 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Remove production DNS entries for dbstore2002

https://gerrit.wikimedia.org/r/545430

cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: dbstore2002.codfw.wmnet

  • dbstore2002.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 545429 merged by Marostegui:
[operations/puppet@production] mariadb: Remove puppet references for dbstore2002

https://gerrit.wikimedia.org/r/545429

Change 545430 merged by Marostegui:
[operations/dns@master] wmnet: Remove production DNS entries for dbstore2002

https://gerrit.wikimedia.org/r/545430

Marostegui renamed this task from Decommission dbstore1001, dbstore2001, dbstore2002 to Decommission dbstore200.codfw.wmnet and dbstore2002.codfw.wmnet.Oct 23 2019, 6:50 AM
Marostegui reassigned this task from RobH to Papaul.
Marostegui edited projects, added ops-codfw; removed Patch-For-Review, Goal, DBA.
Marostegui updated the task description. (Show Details)
Restricted Application added a project: Operations. · View Herald TranscriptOct 23 2019, 6:50 AM

These two hosts are ready for switch disablement and on-site steps

Marostegui renamed this task from Decommission dbstore200.codfw.wmnet and dbstore2002.codfw.wmnet to Decommission dbstore2001.codfw.wmnet and dbstore2002.codfw.wmnet.Oct 23 2019, 6:52 AM
papaul@asw-a-codfw# show | compare 
[edit interfaces interface-range disabled]
     member ge-5/0/16 { ... }
+    member ge-6/0/16;
[edit interfaces]
-   ge-6/0/16 {
-       description dbstore2001;
-       enable;
-   }
papaul@asw-c-codfw# show | compare 
[edit interfaces interface-range vlan-private1-c-codfw]
-    member ge-6/0/11;
[edit interfaces interface-range disabled]
     member ge-6/0/19 { ... }
+    member ge-6/0/11;
[edit interfaces]
-   ge-6/0/11 {
-       description dbstore2002;
-       enable;
-   }
Papaul updated the task description. (Show Details)Oct 28 2019, 2:10 PM
Papaul moved this task from Backlog to Decommission on the ops-codfw board.Oct 31 2019, 12:21 AM
Papaul updated the task description. (Show Details)Nov 5 2019, 4:45 PM
Papaul updated the task description. (Show Details)Nov 5 2019, 8:37 PM

Change 548882 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Remove mgmt DNS for dbstore200[1-2]

https://gerrit.wikimedia.org/r/548882

Change 548882 merged by Papaul:
[operations/dns@master] DNS: Remove mgmt DNS for dbstore200[1-2]

https://gerrit.wikimedia.org/r/548882

Papaul updated the task description. (Show Details)Nov 5 2019, 8:59 PM

complete

Papaul closed this task as Resolved.Nov 5 2019, 9:00 PM