I tried to reimage db1164 as part of T303171, but it never came back up. Connecting to the mgmt interface, i see nothing on the console, even after doing racadm serveraction hardreset, and racadm racreset.
Description
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T291916 Tracking task for Bullseye migrations in production | |||
Resolved | Marostegui | T298585 Upgrade WMF database-and-backup-related hosts to bullseye | |||
Resolved | Ladsgroup | T303171 Upgrade s1 to Bullseye | |||
Resolved | • Cmjohnson | T307198 db1164 fails to POST/boot/etc |
Event Timeline
Comment Actions
This is the sign of a failed DIMM, during post it's failing during the checking memory phase. I attempted to reboot the system to "self-heal" but that failed, The SEL shows. I will request a DIMM replacement from Dell.
Record: 2
Date/Time: 01/22/2022 11:18:41
Source: system
Severity: Non-Critical
Description: The memory health monitor feature has detected a degradation in the DIMM installed in DIMM_A8. Reboot system to initiate self-heal process.
racadm>>
Comment Actions
I've set the host to 'failed' in netbox: https://netbox.wikimedia.org/dcim/devices/2999/
Comment Actions
DIMM replaced and booted into the OS, I was able to update the firmware while it was offline.