Page MenuHomePhabricator

decommission lvs400[1-4].ulsfo.wmnet
Closed, ResolvedPublic

Description

This task will track the decommissioning of lvs400[1-4].ulsfo.wmnet. All four of these hosts are well out of warranty, and lvs400[567] have been purchased to replace them.

lvs400[567]'s setup is tracked via T178436, and they are ready to be placed into service.

lvs4001:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp (replace with role::spare if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host (hosts were powered down and unracked before this step)
  • - remove all remaining puppet references (include role::spare) https://gerrit.wikimedia.org/r/366037
  • - power down host (host is not cabled up, so it cannot power up)
  • - disable switch port (port was never set back up in new racks, so its disabled)
  • - switch port assignment noted on this task (for later removal)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate, salt key removed

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - swapped places with new systems as needed, and now resides in rack with no cabling.
  • - mgmt dns entries removed.
  • - system unracked and decommissioned (by onsite), update racktables with result

lvs4002:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp (replace with role::spare if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host (hosts were powered down and unracked before this step)
  • - remove all remaining puppet references (include role::spare) https://gerrit.wikimedia.org/r/366037
  • - power down host (host is not cabled up, so it cannot power up)
  • - disable switch port (port was never set back up in new racks, so its disabled)
  • - switch port assignment noted on this task (for later removal)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate, salt key removed

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - swapped places with new systems as needed, and now resides in rack with no cabling.
  • - mgmt dns entries removed.

lvs4003:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp (replace with role::spare if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host (hosts were powered down and unracked before this step)
  • - remove all remaining puppet references (include role::spare) https://gerrit.wikimedia.org/r/366037
  • - power down host (host is not cabled up, so it cannot power up)
  • - disable switch port (port was never set back up in new racks, so its disabled)
  • - switch port assignment noted on this task (for later removal)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate, salt key removed

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - swapped places with new systems as needed, and now resides in rack with no cabling.
  • - mgmt dns entries removed.

lvs4004:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp (replace with role::spare if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host (hosts were powered down and unracked before this step)
  • - remove all remaining puppet references (include role::spare) https://gerrit.wikimedia.org/r/366037
  • - power down host (host is not cabled up, so it cannot power up)
  • - disable switch port (port was never set back up in new racks, so its disabled)
  • - switch port assignment noted on this task (for later removal)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate, salt key removed

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - swapped places with new systems as needed, and now resides in rack with no cabling.
  • - mgmt dns entries removed.

Event Timeline

These are now non-primary, but still active as backups for now. Will switch to spare role and remove from router configs post-Thanksgiving and then real decom can start.

Change 393610 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] lvs400[1-4] to spare::system

https://gerrit.wikimedia.org/r/393610

Change 393610 merged by BBlack:
[operations/puppet@production] lvs400[1-4] to spare::system

https://gerrit.wikimedia.org/r/393610

These have been spared-out for a while now and we're fine on the new ones, please kill.

wmf-decommission-host was executed by robh for lvs4001.ulsfo.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor

wmf-decommission-host was executed by robh for lvs4002.ulsfo.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor

wmf-decommission-host was executed by robh for lvs4003.ulsfo.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor

wmf-decommission-host was executed by robh for lvs4004.ulsfo.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor
RobH updated the task description. (Show Details)

So, these are all racked in the two new racks, but without any power or network. As such, I'll just continue with the remainder of the steps (puppet was never disabled, but they'll never have network again.)

Change 464005 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decom lvs400[1-4].ulsfo.wmnet

https://gerrit.wikimedia.org/r/464005

Change 464005 merged by RobH:
[operations/puppet@production] decom lvs400[1-4].ulsfo.wmnet

https://gerrit.wikimedia.org/r/464005

Change 464006 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom lvs400[1-4] dns entries

https://gerrit.wikimedia.org/r/464006

Change 464006 merged by RobH:
[operations/dns@master] decom lvs400[1-4] dns entries

https://gerrit.wikimedia.org/r/464006

RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)
RobH removed subscribers: gerritbot, ops-monitoring-bot.

I went ahead and plugged in power to one of the power supplies for all four of these systems, then usb booted linux and ran a wipe instance per shell for each disk (sda and sdb). estimated time of completion for each is 4days.

wipe complete on all 4 systems.

RobH mentioned this in Unknown Object (Task).Jun 25 2019, 4:20 PM