This task will track the decommission-hardware of server db2101.
With the launch of updates to the decom cookbook, the majority of these steps can be handled by the service owners directly. The DC Ops team only gets involved once the system has been fully removed from service and powered down by the decommission cookbook.
db2101
Steps for service owner:
- - all system services confirmed offline from production use
- - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
- - remove system from all lvs/pybal active configuration
- - any service group puppet/hiera/dsh config removed
- - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
- - remove all remaining puppet references and all host entries in the puppet repo
- - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.
End service owner steps / Begin DC-Ops team steps:
- - system disks removed (by onsite)
- - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
- - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
- - IF DECOM: mgmt dns entries removed.
- - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag
Old debugging information:
1349 Embedded Flash: Restarted 04/11/2024 07:04:32 1 Firmware 1348 Server power restored. 04/11/2024 07:04:30 1 Maintenance, Administration 1347 Server reset. 04/11/2024 07:04:30 1 Maintenance, Administration 1346 The server could not be powered on or a server critical error occurred 04/11/2024 07:04:01 1 Power 1345 The server could not be powered on or a server critical error occurred 04/11/2024 07:02:44 1 Power 1344 The server could not be powered on or a server critical error occurred 04/11/2024 04:35:59 1 Power
ID Severity Class Description Last Update Count Category 381 Network All links are down in adapter HPE Ethernet 1Gb 4-port 331i Adapter - NIC in slot 0 04/11/2024 07:12:45 1 Hardware 380 Network HPE Ethernet 1Gb 4-port 331i Adapter - NIC Connectivity status changed to OK for adapter in slot 0, port 1 04/11/2024 07:12:59 1 Hardware 379 Network HPE Ethernet 1Gb 4-port 331i Adapter - NIC Connectivity status changed to OK for adapter in slot 0, port 1 04/11/2024 07:12:28 1 Hardware 378 UEFI One or more DIMMs have been mapped out due to a memory error, resulting in an unbalanced memory configuration across memory controllers. This may result in non-optimal memory performance. 04/11/2024 07:11:12 1 Configuration 377 UEFI Uncorrectable Memory Error Threshold Exceeded (Processor 2, DIMM 9). The DIMM is mapped out and is currently not available. 04/11/2024 07:04:58 2 Hardware 376 CPU Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000020, Bank 0x00000007, Status 0xBC000000'01010091, Address 0x0000004C'F09F92C0, Misc 0x200401C0'8B002086). 04/11/2024 07:04:00 1 Hardware 375 CPU Uncorrectable Machine Check Exception (Processor 1, APIC ID 0x00000004, Bank 0x00000001, Status 0xBD800000'00100134, Address 0x0000004C'F09F92C0, Misc 0x00000000'00000086). 04/11/2024 07:04:00 1 Hardware 374 UEFI DIMM Failure - Uncorrectable Memory Error (Processor 2, DIMM 9) 04/11/2024 07:04:01 2 Hardware 373 CPU Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000020, Bank 0x00000007, Status 0xBC000000'01010091, Address 0x0000004C'F3D652C0, Misc 0x20040AC5'08202086). 04/11/2024 07:02:44 1 Hardware 372 UEFI DIMM Failure - Uncorrectable Memory Error (Processor 2, DIMM 9) 04/11/2024 04:35:59 1 Hardware 371 CPU Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000020, Bank 0x00000007, Status 0xBC000000'01010091, Address 0x0000004C'F5B752C0, Misc 0x20040AC5'0FE02086). 04/11/2024 04:35:59 1 Hardware