cp1080 failed initial install, reports DIMM A5 has uncorrectable errors on bootup
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Invalid | ayounsi | T199142 Increase network capacity (2018-19 Q1 Goal) | |||
Resolved | ayounsi | T187962 Rack/cable/configure asw2-c-eqiad switch stack | |||
Unknown Object (Task) | |||||
Resolved | BBlack | T195923 rack/setup/install cp1075-cp1090 | |||
Resolved | • Cmjohnson | T201174 cp1080 uncorrectable DIMM error slot A5 |
Event Timeline
Description: A problem was detected in Memory Reference Code (MRC).
Record: 79
Date/Time: 08/02/2018 14:50:45
Source: system
Severity: Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A5.
Record: 80
Date/Time: 08/02/2018 15:14:19
Source: system
Severity: Ok
Description: A problem was detected in Memory Reference Code (MRC).
Record: 81
Date/Time: 08/02/2018 15:14:19
Source: system
Severity: Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A5.
I swapped DIMM in A5 with DIMM in B5 to see if the error follows the DIMM. Cleared the log
Change 450582 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cp1080: remove from conftool/hieradata lists
Change 450582 merged by BBlack:
[operations/puppet@production] cp1080: remove from conftool/hieradata lists
First attempt to reboot for PXE install stops now with:
UEFI0339: The Dual Inline Memory Module (DIMM) in the memory slot B5 is disabled because of initialization errors caused by uncorrectable memory errors, invalid configuration, and others. Check the System Event Log (SEL) or the Lifecycle Controller Log and replace the identified DIMM. UEFI0058: Uncorrectable Memory Error has occurred because a Dual Inline Memory Module (DIMM) is not functioning. Check the System Event Log (SEL) to identify the non-functioning DIMM, and then replace it.
Created a self dispatch with Dell for a new DIMM.
You have successfully submitted request SR977877163.
@BBlack The DIMM has been replaced with new, please resolve task once satisified
Return Tracking
USPS 9202 3946 5301 2439 4635 97
FEDEX 9611918 2393026 76213617
Change 451678 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Revert "cp1080: remove from conftool/hieradata lists"
Change 451678 merged by BBlack:
[operations/puppet@production] Revert "cp1080: remove from conftool/hieradata lists"
Mentioned in SAL (#wikimedia-operations) [2018-08-30T20:06:48Z] <mutante> dzahn@neodymium conftool action : set/pooled=no; selector: name=cp1080.eqiad.wmnet| reason: Strongswan CRITICALs fom Icinga (T201174)
Mentioned in SAL (#wikimedia-operations) [2018-08-30T20:23:01Z] <mutante> cp1080 - powercycled - lots of RECOVERY from Icinga for IPsec connections - leaving depooled so far (T201174)