00:28 < icinga-wm> PROBLEM - Host db2064 is DOWN: PING CRITICAL - Packet loss = 100%
ILO logs:
/system1/log1/record9 Targets Properties number=9 severity=Critical date=05/21/2018 time=00:26 description=System Power Fault Detected (XR: 14 00 MID: FF 4D FC CE C0 FF FF 32 32 0C 0C 40 9C 00 00 01 0F 47 00 00 00 00 00 00 00 00 00 00 00 00 00 00)
UPDATE
- This server is not coming back and should be decommissioned **
Decommission Checklist
- - all system services confirmed offline from production use - should be done by DBA team: https://gerrit.wikimedia.org/r/#/c/434527/
- - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. https://gerrit.wikimedia.org/r/#/c/434297/
- - remove system from all lvs/pybal active configuration - should be done by DBA team
- - any service group puppet/heira/dsh config removed - should be done by DBA team
- - remove site.pp (system cannot be powered on, so remove it directly from site.pp - no need to add role spare.) - should be done by DBA team
A few of these steps cannot be done as the server is not booting up.
START NON-INTERRUPPTABLE STEPS - please assign to @RobH for the non-interrupt steps
- - disable puppet on host (cannot be done - system offline)
- - power down host (already done, the system cannot be back online)
- - disable switch port
- - switch port assignment noted on this task (for later removal): asw-d-codfw:ge-6/0/12
- - remove all remaining puppet references (include role::spare)
- - remove production dns entries
- - puppet node clean, puppet node deactivate
END NON-INTERRUPPTABLE STEPS
- - system disks wiped (by onsite)
- - system unracked and decommissioned (by onsite), update racktables with result
- - switch port configration removed from switch once system is unracked.
- - add system to decommission tracking google sheet
- - mgmt dns entries removed.