Page MenuHomePhabricator

Decommission centrallog1001
Closed, ResolvedPublic

Description

This task will track the decommission-hardware of server centrallog1001.

With the launch of updates to the decom cookbook, the majority of these steps can be handled by the service owners directly. The DC Ops team only gets involved once the system has been fully removed from service and powered down by the decommission cookbook.

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to DC ops team member and site project (ops-sitename) depending on site of server

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

Event Timeline

Change 890884 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] rsyslog: Remove centrallog1001 as TLS rsyslog destination

https://gerrit.wikimedia.org/r/890884

Change 895898 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] centrallog: Remove centrallog1002 from the kafka-jumbo allow list

https://gerrit.wikimedia.org/r/895898

Change 895902 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] centrallog: Add centrallog1002 as the kafkatee active host

https://gerrit.wikimedia.org/r/895902

Change 895902 merged by Andrea Denisse:

[operations/puppet@production] centrallog: Add centrallog1002 as the kafkatee active host

https://gerrit.wikimedia.org/r/895902

Change 890884 merged by Andrea Denisse:

[operations/puppet@production] rsyslog: Remove centrallog1001 as TLS rsyslog destination

https://gerrit.wikimedia.org/r/890884

Change 895898 merged by Andrea Denisse:

[operations/puppet@production] centrallog: Remove centrallog1001 from the kafka-jumbo allow list

https://gerrit.wikimedia.org/r/895898

cookbooks.sre.hosts.decommission executed by denisse@cumin1001 for hosts: centrallog1001

  • centrallog1001 (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
andrea.denisse updated the task description. (Show Details)
andrea.denisse added a project: ops-eqiad.
Jclark-ctr claimed this task.
Jclark-ctr updated the task description. (Show Details)
Jclark-ctr subscribed.

Removed from rack offline script ran

@andrea.denisse just a heads up we got an alarm on our core routers in Eqiad for a BFD/BGP session down.

Seems this server was configured to BGP peer with the CRs?

set protocols bgp group Anycast4 neighbor 10.64.48.113 description centrallog1001

I'll remove that now no big deal.

Change 900422 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Remove BGP peering to centrallog1001 in eqiad

https://gerrit.wikimedia.org/r/900422

Change 900422 merged by jenkins-bot:

[operations/homer/public@master] Remove BGP peering to centrallog1001 in eqiad

https://gerrit.wikimedia.org/r/900422

Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: centrallog1001.eqiad.wmnet