Page MenuHomePhabricator

cp1080 uncorrectable DIMM error slot A5
Closed, ResolvedPublic

Description

cp1080 failed initial install, reports DIMM A5 has uncorrectable errors on bootup

Event Timeline

BBlack created this task.Aug 3 2018, 12:57 PM
Restricted Application added a project: Operations. · View Herald TranscriptAug 3 2018, 12:57 PM
ema moved this task from Triage to Hardware on the Traffic board.Aug 3 2018, 1:40 PM

Description: A problem was detected in Memory Reference Code (MRC).

Record: 79
Date/Time: 08/02/2018 14:50:45
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A5.

Record: 80
Date/Time: 08/02/2018 15:14:19
Source: system
Severity: Ok

Description: A problem was detected in Memory Reference Code (MRC).

Record: 81
Date/Time: 08/02/2018 15:14:19
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A5.

I swapped DIMM in A5 with DIMM in B5 to see if the error follows the DIMM. Cleared the log

Change 450582 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cp1080: remove from conftool/hieradata lists

https://gerrit.wikimedia.org/r/450582

Change 450582 merged by BBlack:
[operations/puppet@production] cp1080: remove from conftool/hieradata lists

https://gerrit.wikimedia.org/r/450582

I checked the log today and the error has not returned.

BBlack added a comment.Aug 6 2018, 4:35 PM

Ok I'll take a stab at another imaging today and see how it goes, thanks!

BBlack added a comment.Aug 6 2018, 5:03 PM

First attempt to reboot for PXE install stops now with:

UEFI0339: The Dual Inline Memory Module (DIMM) in the memory slot B5 is
disabled because of initialization errors caused by uncorrectable memory
errors, invalid configuration, and others.
Check the System Event Log (SEL) or the Lifecycle Controller Log and replace
the identified DIMM.

UEFI0058: Uncorrectable Memory Error has occurred because a Dual Inline Memory
Module (DIMM) is not functioning.
Check the System Event Log (SEL) to identify the non-functioning DIMM, and then
replace it.

Created a self dispatch with Dell for a new DIMM.

You have successfully submitted request SR977877163.

@BBlack The DIMM has been replaced with new, please resolve task once satisified

Return Tracking

USPS 9202 3946 5301 2439 4635 97
FEDEX 9611918 2393026 76213617

Change 451678 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Revert "cp1080: remove from conftool/hieradata lists"

https://gerrit.wikimedia.org/r/451678

Change 451678 merged by BBlack:
[operations/puppet@production] Revert "cp1080: remove from conftool/hieradata lists"

https://gerrit.wikimedia.org/r/451678

BBlack closed this task as Resolved.Aug 10 2018, 12:39 AM

Seems to be working fine now, thanks!

Mentioned in SAL (#wikimedia-operations) [2018-08-30T20:06:48Z] <mutante> dzahn@neodymium conftool action : set/pooled=no; selector: name=cp1080.eqiad.wmnet| reason: Strongswan CRITICALs fom Icinga (T201174)

Mentioned in SAL (#wikimedia-operations) [2018-08-30T20:23:01Z] <mutante> cp1080 - powercycled - lots of RECOVERY from Icinga for IPsec connections - leaving depooled so far (T201174)