Page MenuHomePhabricator

Decommission labservices1001 & labservices1002
Closed, ResolvedPublic

Description

This task will track the decommission of servers labservices1001 and labservices1002.

The first 5 steps should be completed by the service owner that is returning the server to DC-ops (for reclaim to spare or decommissioning, dependent on server configuration and age.)

These are 5+ years old, decommission them and remove from rack due to age.

labservices1001:

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with unracking and state of offline
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

labservices1002:

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with unracking and state of offline
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

network port info:
labservices1001 : asw2-d-eqiad:ge-3/0/9
labservices1002 : asw2-a-eqiad:ge-4/0/12

Event Timeline

Andrew created this task.Apr 25 2019, 2:13 PM

Change 506428 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Move labservices1001/1002 to role::spare and clean up

https://gerrit.wikimedia.org/r/506428

Change 506428 merged by Andrew Bogott:
[operations/puppet@production] Move labservices1001/1002 to role::spare and clean up

https://gerrit.wikimedia.org/r/506428

Andrew updated the task description. (Show Details)Apr 26 2019, 5:17 PM
Andrew reassigned this task from Andrew to RobH.Apr 26 2019, 5:39 PM

Mentioned in SAL (#wikimedia-operations) [2019-04-30T07:24:08Z] <marostegui> Remove labservices1001 and labservices1002 from tendril T221857

RobH renamed this task from Decommission labservices1001, 1002 to Decommission labservices1001 & labservices1002.Apr 30 2019, 3:54 PM
RobH triaged this task as Normal priority.
RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)
RobH added a comment.Apr 30 2019, 3:57 PM

network info:

labservices1001 : asw2-d-eqiad:ge-3/0/9

labservices1002 : asw2-a-eqiad:ge-4/0/12

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: labservices1001.wikimedia.org

  • labservices1001.wikimedia.org
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: labservices1002.wikimedia.org

  • labservices1002.wikimedia.org
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor

Change 507354 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decom labservices100[12] references

https://gerrit.wikimedia.org/r/507354

Change 507354 merged by RobH:
[operations/puppet@production] decom labservices100[12] references

https://gerrit.wikimedia.org/r/507354

Change 507356 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom labservices100[12] prod dns

https://gerrit.wikimedia.org/r/507356

Change 507356 merged by RobH:
[operations/dns@master] decom labservices100[12] prod dns

https://gerrit.wikimedia.org/r/507356

RobH reassigned this task from RobH to Cmjohnson.Apr 30 2019, 4:09 PM
RobH updated the task description. (Show Details)
RobH edited projects, added ops-eqiad; removed Patch-For-Review.
RobH moved this task from Backlog to Decommission on the ops-eqiad board.
RobH added a subscriber: RobH.
Jclark-ctr updated the task description. (Show Details)Jul 31 2019, 4:49 PM
Jclark-ctr updated the task description. (Show Details)Jul 31 2019, 4:52 PM
Jclark-ctr updated the task description. (Show Details)Aug 21 2019, 6:00 PM

Change 538106 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Removing mgmt ip for decom labservices100[1-2]

https://gerrit.wikimedia.org/r/538106

Change 538106 merged by Cmjohnson:
[operations/dns@master] Removing mgmt ip for decom labservices100[1-2]

https://gerrit.wikimedia.org/r/538106

Cmjohnson closed this task as Resolved.Sep 19 2019, 8:54 PM
Cmjohnson updated the task description. (Show Details)