Page MenuHomePhabricator

Decommission an-launcher1002
Closed, ResolvedPublic

Description

This task will track the decommission-hardware of server an-launcher1002.eqiad.wmnet
With the launch of updates to the decom cookbook, the majority of these steps can be handled by the service owners directly. The DC Ops team only gets involved once the system has been fully removed from service and powered down by the decommission cookbook.

an-launcher1002.eqiad.wmnet

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

There is no direct replacement for this machine, but it is EOL. We have discussed migrating analytics airflow to a dedicated VM and migrating systemd jobs to another launcher VM instead.

Event Timeline

Gehel triaged this task as High priority.Dec 20 2023, 10:45 AM
Gehel moved this task from Incoming to Hardware refresh on the Data-Platform-SRE board.

Hi, checking on this one on behalf of dcops. Is there still a plan to refresh this server?

BTullis subscribed.

We're now making good progress on this. We have created an-launcher1003 and configured it with the same role as an-launcher1002, but at present all of the scheduled jobs from the original host are disabled on the new host.
The plan is to migrate these jobs a few at a time, until no workload remains on an-launcher1002. Then we'll be able to decommission it.

BTullis renamed this task from Plan to decom an-launcher1002 to Decommission an-launcher1002.Sep 30 2025, 8:46 AM

Moving to in-progress, since all active workload has been migrated to an-launcher1003.

Change #1201559 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/hdfs-tools/deploy@master] Update scap targets for hdfs-tools

https://gerrit.wikimedia.org/r/1201559

Change #1201559 merged by Btullis:

[analytics/hdfs-tools/deploy@master] Update scap targets for hdfs-tools

https://gerrit.wikimedia.org/r/1201559

Change #1201564 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Switch an-launcher1002 to the insetup role prior to decommission

https://gerrit.wikimedia.org/r/1201564

Change #1201581 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/refinery/scap@master] Replace an-launcher1002 with an-launcher1003

https://gerrit.wikimedia.org/r/1201581

Change #1201564 merged by Btullis:

[operations/puppet@production] Switch an-launcher1002 to the insetup role prior to decommission

https://gerrit.wikimedia.org/r/1201564

Icinga downtime and Alertmanager silence (ID=03c1db57-1493-4458-91d0-7715af2bbce0) set by brouberol@cumin1003 for 14 days, 0:00:00 on 1 host(s) and their services with reason: host is being decommissioned

an-launcher1002.eqiad.wmnet

Change #1201581 merged by Btullis:

[analytics/refinery/scap@master] Replace an-launcher1002 with an-launcher1003

https://gerrit.wikimedia.org/r/1201581

BTullis updated the task description. (Show Details)
BTullis updated the task description. (Show Details)

cookbooks.sre.hosts.decommission executed by btullis@cumin1003 for hosts: an-launcher1002.eqiad.wmnet

  • an-launcher1002.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
BTullis added a project: ops-eqiad.
BTullis updated the task description. (Show Details)

Change #1202090 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove last reference to an-launcher1002

https://gerrit.wikimedia.org/r/1202090

Change #1202090 merged by Btullis:

[operations/puppet@production] Remove last reference to an-launcher1002

https://gerrit.wikimedia.org/r/1202090

Jclark-ctr claimed this task.
Jclark-ctr updated the task description. (Show Details)