Page MenuHomePhabricator

Decommission db2101 (was: db2101 crashed)
Closed, ResolvedPublic

Description

This task will track the decommission-hardware of server db2101.

With the launch of updates to the decom cookbook, the majority of these steps can be handled by the service owners directly. The DC Ops team only gets involved once the system has been fully removed from service and powered down by the decommission cookbook.

db2101

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

Old debugging information:

1349	
Embedded Flash: Restarted	04/11/2024 07:04:32	1	Firmware
1348	
Server power restored.	04/11/2024 07:04:30	1	Maintenance, Administration
1347	
	Server reset.	04/11/2024 07:04:30	1	Maintenance, Administration
1346	
	The server could not be powered on or a server critical error occurred	04/11/2024 07:04:01	1	Power
1345	
	The server could not be powered on or a server critical error occurred	04/11/2024 07:02:44	1	Power
1344	
	The server could not be powered on or a server critical error occurred	04/11/2024 04:35:59	1	Power
 ID
Severity
	
Class
	
Description
	
Last Update
	
Count
	
Category
	381	
	Network	All links are down in adapter HPE Ethernet 1Gb 4-port 331i Adapter - NIC in slot 0	04/11/2024 07:12:45	1	Hardware
	380	
	Network	HPE Ethernet 1Gb 4-port 331i Adapter - NIC Connectivity status changed to OK for adapter in slot 0, port 1	04/11/2024 07:12:59	1	Hardware
	379	
	Network	HPE Ethernet 1Gb 4-port 331i Adapter - NIC Connectivity status changed to OK for adapter in slot 0, port 1	04/11/2024 07:12:28	1	Hardware
	378	
UEFI	One or more DIMMs have been mapped out due to a memory error, resulting in an unbalanced memory configuration across memory controllers. This may result in non-optimal memory performance.	04/11/2024 07:11:12	1	Configuration
	377	
	UEFI	Uncorrectable Memory Error Threshold Exceeded (Processor 2, DIMM 9). The DIMM is mapped out and is currently not available.	04/11/2024 07:04:58	2	Hardware
	376	
	CPU	Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000020, Bank 0x00000007, Status 0xBC000000'01010091, Address 0x0000004C'F09F92C0, Misc 0x200401C0'8B002086). 	04/11/2024 07:04:00	1	Hardware
	375	
	CPU	Uncorrectable Machine Check Exception (Processor 1, APIC ID 0x00000004, Bank 0x00000001, Status 0xBD800000'00100134, Address 0x0000004C'F09F92C0, Misc 0x00000000'00000086). 	04/11/2024 07:04:00	1	Hardware
	374	
	UEFI	DIMM Failure - Uncorrectable Memory Error (Processor 2, DIMM 9)	04/11/2024 07:04:01	2	Hardware
	373	
	CPU	Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000020, Bank 0x00000007, Status 0xBC000000'01010091, Address 0x0000004C'F3D652C0, Misc 0x20040AC5'08202086). 	04/11/2024 07:02:44	1	Hardware
	372	
	UEFI	DIMM Failure - Uncorrectable Memory Error (Processor 2, DIMM 9)	04/11/2024 04:35:59	1	Hardware
	371	
	CPU	Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000020, Bank 0x00000007, Status 0xBC000000'01010091, Address 0x0000004C'F5B752C0, Misc 0x20040AC5'0FE02086). 	04/11/2024 04:35:59	1	Hardware

Event Timeline

jcrespo renamed this task from db2101 crashed to Decommission db2101 (was: db2101 crashed).Mon, Apr 15, 9:21 AM
jcrespo updated the task description. (Show Details)

Change #1019689 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mariadb: Remove db2101 from service and make it a spare

https://gerrit.wikimedia.org/r/1019689

Change #1019689 merged by Jcrespo:

[operations/puppet@production] mariadb: Remove db2101 from services

https://gerrit.wikimedia.org/r/1019689

Change #1019849 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mariadb: Fully remove db2101 from puppet

https://gerrit.wikimedia.org/r/1019849

cookbooks.sre.hosts.decommission executed by jynus@cumin2002 for hosts: db2101.codfw.wmnet

  • db2101.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change #1019849 merged by Jcrespo:

[operations/puppet@production] mariadb: Fully remove db2101 from puppet

https://gerrit.wikimedia.org/r/1019849

jcrespo moved this task from Triage to Done on the Data-Persistence-Backup board.
jcrespo added a subscriber: ABran-WMF.

CC @ABran-WMF in case I missed something.

Jhancock.wm claimed this task.
Jhancock.wm updated the task description. (Show Details)