Page MenuHomePhabricator

Decommission analytics1032
Closed, ResolvedPublic

Description

The analytics1032.eqiad.wmnet host is an old hadoop node that we have used for the Hadoop testing cluster. This was allowed by the SRE team as a special case since testing the Kerberos authentication scheme in cloud/labs was not enough. The host is giving more issues that benefits and since it is OOW, let's decom it.

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps,

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

[]x - disable puppet on host (host stuck in boot, skipped)

  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: system added back to spares tracking (by onsite)

Event Timeline

elukey created this task.Sep 17 2019, 8:04 AM

Change 537321 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Prepare analytics1032 for decommission

https://gerrit.wikimedia.org/r/537321

Mentioned in SAL (#wikimedia-analytics) [2019-09-17T08:19:40Z] <elukey> manually decommed analytics1032 for hdfs/yarn on the Hadoop testing cluster - T233080

Change 537321 merged by Elukey:
[operations/puppet@production] Prepare analytics1032 for decommission

https://gerrit.wikimedia.org/r/537321

elukey assigned this task to RobH.Sep 17 2019, 9:56 AM
elukey triaged this task as Medium priority.
elukey updated the task description. (Show Details)

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1032.eqiad.wmnet

  • analytics1032.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Failed to power off, manual intervention required: Remote IPMI for analytics1032.mgmt.eqiad.wmnet failed (exit=1): b''
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

elukey added a comment.Oct 1 2019, 6:38 AM

The host is stuck while booting, so the above script failed. I manually powered it off, but the clean up in puppet/debmonitor/etc.. should have been done anyway.

elukey updated the task description. (Show Details)Oct 1 2019, 6:41 AM

Change 540017 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove analytics1032 from puppet

https://gerrit.wikimedia.org/r/540017

Change 540017 merged by Elukey:
[operations/puppet@production] Remove analytics1032 from puppet

https://gerrit.wikimedia.org/r/540017

elukey added a comment.Oct 1 2019, 6:48 AM
elukey@asw2-c-eqiad> show interfaces descriptions | match analytics1032
ge-3/0/12       up    down analytics1032 - no-bw-mon
elukey updated the task description. (Show Details)Oct 1 2019, 6:49 AM

Change 540019 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Remove analytics1032's prod DNS records

https://gerrit.wikimedia.org/r/540019

Change 540019 merged by Elukey:
[operations/dns@master] Remove analytics1032's prod DNS records

https://gerrit.wikimedia.org/r/540019

elukey updated the task description. (Show Details)Oct 1 2019, 7:07 AM
elukey added a comment.Oct 2 2019, 7:03 PM
elukey@asw2-c-eqiad# show | compare
[edit interfaces interface-range disabled]
     member ge-7/0/34 { ... }
+    member ge-3/0/12;
[edit interfaces]
-   ge-3/0/12 {
-       description "analytics1032 - no-bw-mon";
-   }
elukey updated the task description. (Show Details)Oct 2 2019, 7:05 PM
elukey reassigned this task from RobH to Cmjohnson.Oct 9 2019, 9:23 AM
RobH removed a subscriber: RobH.Mar 3 2020, 6:01 PM
RobH updated the task description. (Show Details)
RobH moved this task from Backlog to Decommission on the ops-eqiad board.Apr 1 2020, 5:51 PM

Change 597869 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Removing mgmt for the asset tag associated with analytics1032

https://gerrit.wikimedia.org/r/597869

Change 597869 merged by Cmjohnson:
[operations/dns@master] Removing mgmt for the asset tag associated with analytics1032

https://gerrit.wikimedia.org/r/597869

Cmjohnson closed this task as Resolved.May 21 2020, 9:48 PM
Cmjohnson updated the task description. (Show Details)

removed from rack, mgmt dns removed, switch ports were already removed. Updated netbox