Page MenuHomePhabricator

Decommission analytics1032
Closed, ResolvedPublic


The analytics1032.eqiad.wmnet host is an old hadoop node that we have used for the Hadoop testing cluster. This was allowed by the SRE team as a special case since testing the Kerberos authentication scheme in cloud/labs was not enough. The host is giving more issues that benefits and since it is OOW, let's decom it.

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps,

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

[]x - disable puppet on host (host stuck in boot, skipped)

  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: system added back to spares tracking (by onsite)

Event Timeline

Change 537321 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Prepare analytics1032 for decommission

Mentioned in SAL (#wikimedia-analytics) [2019-09-17T08:19:40Z] <elukey> manually decommed analytics1032 for hdfs/yarn on the Hadoop testing cluster - T233080

Change 537321 merged by Elukey:
[operations/puppet@production] Prepare analytics1032 for decommission

elukey triaged this task as Medium priority.
elukey updated the task description. (Show Details)

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1032.eqiad.wmnet

  • analytics1032.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Failed to power off, manual intervention required: Remote IPMI for analytics1032.mgmt.eqiad.wmnet failed (exit=1): b''
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

The host is stuck while booting, so the above script failed. I manually powered it off, but the clean up in puppet/debmonitor/etc.. should have been done anyway.

Change 540017 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove analytics1032 from puppet

Change 540017 merged by Elukey:
[operations/puppet@production] Remove analytics1032 from puppet

elukey@asw2-c-eqiad> show interfaces descriptions | match analytics1032
ge-3/0/12       up    down analytics1032 - no-bw-mon

Change 540019 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Remove analytics1032's prod DNS records

Change 540019 merged by Elukey:
[operations/dns@master] Remove analytics1032's prod DNS records

elukey@asw2-c-eqiad# show | compare
[edit interfaces interface-range disabled]
     member ge-7/0/34 { ... }
+    member ge-3/0/12;
[edit interfaces]
-   ge-3/0/12 {
-       description "analytics1032 - no-bw-mon";
-   }

Change 597869 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Removing mgmt for the asset tag associated with analytics1032

Change 597869 merged by Cmjohnson:
[operations/dns@master] Removing mgmt for the asset tag associated with analytics1032

Cmjohnson updated the task description. (Show Details)

removed from rack, mgmt dns removed, switch ports were already removed. Updated netbox