Page MenuHomePhabricator

cp2029 crashed, hardware memory error
Closed, ResolvedPublic

Description

cp2029 went offline at ~3:30 UTC on Dec. 24th, about 40 minutes later I powercycled it via mgmt and once it came back up I depooled it.

Looking in kernel log there's:

Dec 24 02:57:30 cp2029 kernel: [27361439.958255] Disabling lock debugging due to kernel taint
Dec 24 02:57:30 cp2029 kernel: [27361439.958450] mce: Uncorrected hardware memory error in user-access at 25960fb0c0
Dec 24 02:57:30 cp2029 kernel: [27361439.958472] mce: [Hardware Error]: Machine check events logged
Dec 24 02:57:31 cp2029 kernel: [27361440.005909] Memory failure: 0x25960fb: Killing purged:19885 due to hardware memory corruption
Dec 24 02:57:31 cp2029 kernel: [27361440.014704] Memory failure: 0x25960fb: recovery action for dirty LRU page: Recovered

Tagging DC-ops since it seems to be a hardware error. I've left the host depooled.

Event Timeline

System Event Log shows a failure on DIMM A1:

-------------------------------------------------------------------------------
Record:      49
Date/Time:   12/24/2021 03:32:47
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.
-------------------------------------------------------------------------------

@Vgutierrez Happy new year can I power this server off so I can swap DIMM A1 with DIMM B1?

@Papaul yes, go ahead please. Happy new year :)

Icinga downtime set by vgutierrez@cumin1001 for 0:30:00 1 host(s) and their services with reason: Swapping faulty DIMM with B1

cp2029.codfw.wmnet

I swapped DIMM A1 wiht DIMM B1 to see if the error shows on B1. I am leaving the task open for now .

Papaul triaged this task as Medium priority.Jan 3 2022, 3:53 PM

Checked the server today no error so far on DIMM B1, closing the task. if we have the problem we can re-open the task.