Page MenuHomePhabricator

Decommission ms-be1027
Closed, ResolvedPublic

Description

ms-be1027 has faulty hw and ready to be decom

Decommission Checklist

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - power down host
  • - update status in netbox (inventory for decom, planned for spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove debmonitor entries

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite) use hdparm for ssds and wipe for hdds
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

Event Timeline

John checked on this first thing this morning, first thing. The power light was blinking green but are not getting any power. I had him reseat and drain flea power. That did not work. He then took it down to minimum hardware 1 DIMM and 1 CPU and the servers will still not power on. We've had this issue before and is usually resolved with a mainboard swap but this sever is out of warranty.

Reporting from IRC, it doesn't look like the hw is coming back at all, with this system being OOW and a new batch of ms-be hosts coming in, I'll start removing this host from service for its premature decom

Mentioned in SAL (#wikimedia-operations) [2019-09-20T07:14:23Z] <godog> eqiad-prod: start ms-be1027 decom - T233289

fgiunchedi added a project: User-fgiunchedi.

I'll take the task until the host is no longer active in swift

Mentioned in SAL (#wikimedia-operations) [2019-09-23T07:40:29Z] <godog> swift eqiad-prod: continue ms-be1027 decom - T233289

Mentioned in SAL (#wikimedia-operations) [2019-09-24T07:18:22Z] <godog> swift eqiad-prod: continue ms-be1027 decom T233289

cookbooks.sre.hosts.decommission executed by filippo@cumin1001 for hosts: ms-be1027.eqiad.wmnet

  • ms-be1027.eqiad.wmnet (FAIL)
    • Host steps raised exception: 'HostActions' object has no attribute 'error'

ERROR: some step on some host failed, check the bolded items above

Indeed the decom script failed on this host that's powered down already, the full trace is

root@cumin1001:~# cookbook sre.hosts.decommission -t T233289 ms-be1027.eqiad.wmnet
START - Cookbook sre.hosts.decommission
ATTENTION: destructive action for 1 hosts: ms-be1027.eqiad.wmnet
Are you sure to proceed?
Type "done" to proceed
> done
Management Password: 
Scheduling downtime on Icinga server icinga1001.wikimedia.org for hosts: ['ms-be1027.eqiad.wmnet']
Downtimed host on Icinga
Scheduling downtime on Icinga server icinga1001.wikimedia.org for hosts: ['ms-be1027.mgmt.eqiad.wmnet']
Downtimed management interface on Icinga
Host steps raised exception
Traceback (most recent call last):
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/decommission.py", line 130, in _decommission_host
    remote_host.run_sync('true')
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 476, in run_sync
    batch_sleep=batch_sleep, is_safe=is_safe)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 646, in _execute
    raise RemoteExecutionError(ret, 'Cumin execution failed')
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/decommission.py", line 200, in run
    host_actions = _decommission_host(host, spicerack, reason)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/decommission.py", line 133, in _decommission_host
    host_actions.error(
AttributeError: 'HostActions' object has no attribute 'error'
Host steps raised exception: 'HostActions' object has no attribute 'error'
ERROR: some step failed, check the task updates.
Updated Phabricator task T233289
END (FAIL) - Cookbook sre.hosts.decommission (exit_code=True)

cookbooks.sre.hosts.decommission executed by filippo@cumin1001 for hosts: ms-be1027.eqiad.wmnet

  • ms-be1027.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Change 539136 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Decom ms-be1027

https://gerrit.wikimedia.org/r/539136

Mentioned in SAL (#wikimedia-operations) [2019-09-26T07:49:54Z] <godog> swift eqiad-prod: continue ms-be1027 decom - T233289

Change 539136 merged by Filippo Giunchedi:
[operations/puppet@production] Decom ms-be1027

https://gerrit.wikimedia.org/r/539136

Mentioned in SAL (#wikimedia-operations) [2019-09-27T07:36:12Z] <godog> swift eqiad-prod: remove ms-be1027 - T233289

Change 539491 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] Remove ms-be1027 production entries

https://gerrit.wikimedia.org/r/539491

Change 539491 merged by Filippo Giunchedi:
[operations/dns@master] Remove ms-be1027 production entries

https://gerrit.wikimedia.org/r/539491

fgiunchedi renamed this task from Unable to power on ms-be1027 to Decommission ms-be1027.Sep 27 2019, 7:44 AM
fgiunchedi removed fgiunchedi as the assignee of this task.
fgiunchedi assigned this task to Cmjohnson.
fgiunchedi edited projects, added decommission-hardware; removed User-fgiunchedi.
fgiunchedi updated the task description. (Show Details)

@Cmjohnson host is ready for decom! thanks

removed all drives degaussed drives. hardware failure will not boot to wipe drives

papaul@asw2-d-eqiad# show | compare 
[edit interfaces]
-   xe-7/0/13 {
-       description ms-be1027;
-       enable;
-   }
Papaul updated the task description. (Show Details)

Change 549912 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Remove mgmt DNS for analytic1003, dbstore1002 abd ms-be1027

https://gerrit.wikimedia.org/r/549912

Change 549912 merged by Papaul:
[operations/dns@master] DNS: Remove mgmt DNS for analytic1003, dbstore1002 abd ms-be1027

https://gerrit.wikimedia.org/r/549912