Page MenuHomePhabricator

decom tungsten
Closed, ResolvedPublic

Description

This task is to fully remove tungsten from production once the parent task is resolved.

This task will track the decommission-hardware of server tungsten.eqiad.wmnet.

With the launch of updates to the decom cookbook, the majority of these steps can be handled by the service owners directly. The DC Ops team only gets involved once the system has been fully removed from service and powered down by the decommission cookbook.

tungsten.eqiad.wmnet

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal.
  • - remove all remaining puppet references (include role::spare) and all host entries in the puppet repo
  • - remove ALL dns entries except the asset tag mgmt entries.
  • - reassign task from service owner to DC ops team member depending on site of servee.

End service owner steps / Begin DC-Ops team steps:

  • - disable switch port / set to asset tag if host isn't being unracked / remove from switch if being unracked.
  • - system disks wiped (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

Event Timeline

Dzahn created this task.Aug 13 2020, 10:28 PM
Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptAug 13 2020, 10:28 PM

Change 620128 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] webperf: remove the xhgui_old_host parameter

https://gerrit.wikimedia.org/r/620128

Change 620129 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] remove tungsten from site, DHCP and partman

https://gerrit.wikimedia.org/r/620129

Change 620130 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] delete role::xhgui::app

https://gerrit.wikimedia.org/r/620130

Change 620131 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] base: remove tungsten from check-microcode.py

https://gerrit.wikimedia.org/r/620131

Dzahn triaged this task as Medium priority.Aug 13 2020, 10:40 PM
dpifke moved this task from Inbox to Doing on the Performance-Team board.Aug 13 2020, 11:47 PM

Change 620128 merged by Dzahn:
[operations/puppet@production] webperf: remove the xhgui_old_host parameter

https://gerrit.wikimedia.org/r/620128

Dzahn updated the task description. (Show Details)Aug 20 2020, 10:02 PM

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: tungsten.eqiad.wmnet

  • tungsten.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 620129 merged by Dzahn:
[operations/puppet@production] remove tungsten from site, DHCP and partman

https://gerrit.wikimedia.org/r/620129

Change 620130 merged by Dzahn:
[operations/puppet@production] delete role::xhgui::app

https://gerrit.wikimedia.org/r/620130

Change 621606 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] cumin: update xhgui alias to apply to new role name

https://gerrit.wikimedia.org/r/621606

Change 621606 merged by Dzahn:
[operations/puppet@production] cumin: update xhgui alias to apply to new role name

https://gerrit.wikimedia.org/r/621606

Change 620131 merged by Dzahn:
[operations/puppet@production] base: remove tungsten from check-microcode.py

https://gerrit.wikimedia.org/r/620131

Change 621608 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] hiera: remove hosts/tungsten.yaml

https://gerrit.wikimedia.org/r/621608

Change 621608 merged by Dzahn:
[operations/puppet@production] hiera: remove hosts/tungsten.yaml

https://gerrit.wikimedia.org/r/621608

Change 621609 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] decom tungsten.eqiad.wmnet

https://gerrit.wikimedia.org/r/621609

Change 621609 merged by Dzahn:
[operations/dns@master] decom tungsten.eqiad.wmnet

https://gerrit.wikimedia.org/r/621609

Dzahn updated the task description. (Show Details)
Dzahn moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.
Dzahn added projects: DC-Ops, ops-eqiad.
Restricted Application added a project: Operations. · View Herald TranscriptAug 20 2020, 11:13 PM
Dzahn reassigned this task from Dzahn to Cmjohnson.Aug 20 2020, 11:13 PM
Dzahn removed Cmjohnson as the assignee of this task.Aug 20 2020, 11:16 PM
Dzahn moved this task from Doing 😎 to Externally Blocked 🚧 on the serviceops board.
Dzahn moved this task from Doing to Radar on the Performance-Team board.
Dzahn edited projects, added Performance-Team (Radar); removed Performance-Team.
Dzahn added a subscriber: Cmjohnson.

I am not sure if I am supposed to directly assign to people now or keep just using the ops-<dc> tag as before. Looks like the words on the decom template got changed recently.

Cmjohnson moved this task from Backlog to Decommission on the ops-eqiad board.Aug 24 2020, 12:41 PM
Cmjohnson closed this task as Resolved.Wed, Sep 2, 7:25 PM
Cmjohnson updated the task description. (Show Details)

removed from rack, switch port and script update