Page MenuHomePhabricator

hw troubleshooting: memory error for elastic1097
Closed, ResolvedPublicRequest

Description

Hi!

I see the following when rebooting elastic1097:

UEFI0058: Uncorrectable Memory Error has occurred because a Dual Inline Memory
Module (DIMM) is not functioning.
Check the System Event Log (SEL) to identify the non-functioning DIMM, and then
replace it.

@Cmjohnson could you please check when you have a moment?

  • - Provide FQDN of system.
  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
  • - Put system into a failed state in Netbox. - was staged, so its not online for use yet, and is offline in icinga.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc) - 1 of 14 newly staged hosts so while it is likely not highly urgent, but requires sub-team feedback.
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

Event Timeline

RobH updated the task description. (Show Details)

This is a newly racked host so this could just require reseating to clear it up, as the memory can unseat during shipment. If reseating doesn't fix it, then we'll need to put in a self dispatch for a new dimm.

Since the memory error shows on POST, we'll know right away if it clears up.

UEFI0339: The Dual Inline Memory Module (DIMM) in the memory slot A2 is
disabled because of initialization errors caused by uncorrectable memory
errors, invalid configuration, and others.
Check the System Event Log (SEL) or the Lifecycle Controller Log and replace
the identified DIMM.

UEFI0058: Uncorrectable Memory Error has occurred because a Dual Inline Memory
Module (DIMM) is not functioning.
Check the System Event Log (SEL) to identify the non-functioning DIMM, and then
replace it.

Dell request for new DIMM place, You have successfully submitted request SR1091181415.

replaced the DIMM, cleared the log.

Mentioned in SAL (#wikimedia-operations) [2022-06-02T20:16:22Z] <ryankemper> T306449 Marked elastic1097 as Staged in Netbox (was previously failed, but fixed in https://phabricator.wikimedia.org/T306449#7888260)