Page MenuHomePhabricator

Decommission contint2001.wikimedia.org
Closed, ResolvedPublicRequest

Description

This task will track the decommission-hardware of server contint2001.wikimedia.org

With the launch of updates to the decom cookbook, the majority of these steps can be handled by the service owners directly. The DC Ops team only gets involved once the system has been fully removed from service and powered down by the decommission cookbook.

contint2001.wikimedia.org

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to DC ops team member and site project (ops-sitename) depending on site of server

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

Event Timeline

LSobanski added a subscriber: hashar.

@hashar: could you confirm whether we're OK to go ahead with decommissioning contint1001.wikimedia.org?

Potentially the home directories since we might have some scripts there. For the other server (contint2001) I asked to delay the decommissioning so I could look at the home directory but eventually Daniel proposed to create a tarball and put it to the new contint2002 (though I can't find it right now).

I no more have access to contint1001.wikimedia.org if we could get a tarball of /home uploaded to contint2002.wikimedia.org (I guess under /srv) that would be great. I can then dig in the home dir and retrieve any script we might had there and upstream them to a git repo (integration/config or puppet).

Hey @hashar I was looking into this but I'd just like some clarification, the files you require are on contint1001, right?

Arnoldokoth renamed this task from decommission contint1001.wikimedia.org to Decommission contint2001.wikimedia.org.Jul 31 2023, 9:52 PM
Arnoldokoth updated the task description. (Show Details)

@hashar Slight confusion initially. contint1001 was already decom'd so I have edited the ticket to refer to contint2001.

@hashar Kindly confirm if all the files you wanted are in there.

aokoth@contint2002:~$ sudo ls -l /home/hashar/hashar.tar.gz 
-rwxr--r-- 1 hashar wikidev 4258053 Jul 31 22:16 /home/hashar/hashar.tar.gz

@Arnoldokoth can you do a tarball of the whole /home from contint2001 and put it in /root on contint2002? There are potential scripts to be salvaged from Kunal and Tyler home dirs as well. I will dig in the homes when I am back from vacations :)

Sure thing @hashar! Enjoy your vacation. :)

root@contint2002:~# ls -lah /root/home.tar.gz 
-rw-r--r-- 1 root root 293M Aug  4 11:05 /root/home.tar.gz

cookbooks.sre.hosts.decommission executed by aokoth@cumin1001 for hosts: contint2001.wikimedia.org

  • contint2001.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 949040 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/puppet@production] contint2001: puppet cleanup post decom

https://gerrit.wikimedia.org/r/949040

Change 949040 merged by AOkoth:

[operations/puppet@production] contint2001: puppet cleanup post decom

https://gerrit.wikimedia.org/r/949040

Jelto edited projects, added ops-codfw; removed DC-Ops.
Jelto added subscribers: Arnoldokoth, Jelto.

Adjusting tags for DC-Ops (they need ops-codfw tag instead of team tag to proceed).

Papaul moved this task from Non-Urgent to Decommission on the ops-codfw board.
Jhancock.wm updated the task description. (Show Details)

Change 1007427 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] hieradata: delete hosts/contint2001

https://gerrit.wikimedia.org/r/1007427

Change 1007427 merged by Dzahn:

[operations/puppet@production] hieradata: delete hosts/contint2001

https://gerrit.wikimedia.org/r/1007427