Page MenuHomePhabricator

amber light on cp5006/5007
Closed, ResolvedPublic

Description

When I walked out of eqsin I noticed that cp5006/5007 had an amber blinking light that wasn't there when I walked in.
I checked all cables and Icinga but couldn't find anything wrong.
Looking at dmesg they both have:

ayounsi@cp5007:~$ sudo dmesg | grep Error
[    3.503337] ERST: Error Record Serialization Table (ERST) support is initialized.
[4929877.901512] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[4929877.901515] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[4929877.901516] {1}[Hardware Error]: event severity: corrected
[4929877.901518] {1}[Hardware Error]:  Error 0, type: corrected
[4929877.901519] {1}[Hardware Error]:  fru_text: A5
[4929877.901520] {1}[Hardware Error]:   section_type: memory error
[4929877.901521] {1}[Hardware Error]:   error_status: 0x0000000000000400
[4929877.901522] {1}[Hardware Error]:   physical_address: 0x0000002b1d8edfc0
[4929877.901526] {1}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 1 row: 22987 column: 888 
[4929877.901527] {1}[Hardware Error]:   error_type: 2, single-bit ECC
[4929915.523101] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[4929915.523104] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[4929915.523105] {2}[Hardware Error]: event severity: corrected
[4929915.523107] {2}[Hardware Error]:  Error 0, type: corrected
[4929915.523108] {2}[Hardware Error]:  fru_text: A5
[4929915.523109] {2}[Hardware Error]:   section_type: memory error
[4929915.523110] {2}[Hardware Error]:   error_status: 0x0000000000000400
[4929915.523111] {2}[Hardware Error]:   physical_address: 0x0000002bec92de80
[4929915.523114] {2}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 1 row: 22228 column: 888 
[4929915.523115] {2}[Hardware Error]:   error_type: 2, single-bit ECC
[4929915.523125] mce: [Hardware Error]: Machine check events logged
[4929928.059079] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[4929928.059081] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[4929928.059082] {3}[Hardware Error]: event severity: corrected
[4929928.059084] {3}[Hardware Error]:  Error 0, type: corrected
[4929928.059085] {3}[Hardware Error]:  fru_text: A5
[4929928.059086] {3}[Hardware Error]:   section_type: memory error
[4929928.059088] {3}[Hardware Error]:   error_status: 0x0000000000000400
[4929928.059089] {3}[Hardware Error]:   physical_address: 0x0000000bf402df80
[4929928.059092] {3}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 1 row: 22352 column: 888 
[4929928.059093] {3}[Hardware Error]:   error_type: 2, single-bit ECC
[4929928.059102] mce: [Hardware Error]: Machine check events logged
[4929942.246950] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[4929942.246952] {4}[Hardware Error]: It has been corrected by h/w and requires no further action
[4929942.246953] {4}[Hardware Error]: event severity: corrected
[4929942.246954] {4}[Hardware Error]:  Error 0, type: corrected
[4929942.246956] {4}[Hardware Error]:  fru_text: A5
[4929942.246957] {4}[Hardware Error]:   section_type: memory error
[4929942.246958] {4}[Hardware Error]:   error_status: 0x0000000000000400
[4929942.246959] {4}[Hardware Error]:   physical_address: 0x0000000bf402df80
[4929942.246963] {4}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 1 row: 22352 column: 888 
[4929942.246964] {4}[Hardware Error]:   error_type: 2, single-bit ECC
[4930172.678164] mce: [Hardware Error]: Machine check events logged

So maybe the cycle between error, then self correcting it?

I'm in Singapore until Sunday morning, please let me know if there is anything I can do onsite.

Related Objects

StatusSubtypeAssignedTask
ResolvedRobH
ResolvedRobH
ResolvedRobH

Event Timeline

ayounsi triaged this task as High priority.Feb 21 2019, 9:58 AM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

So I updated the bios on cp5007, and this happened in post:

UEFI0107: One or more memory errors have occurred on memory slot: A1.
Remove input power to the system, reseat the DIMM module and restart the
system. If the issues persist, replace the faulty memory module identified in
the message.

UEFI0081: Memory configuration has changed from the last time the system was
started.
If the change is expected, no action is necessary. Otherwise, check the DIMM
population inside the system and memory settings in System Setup.

UEFI0058: Uncorrectable Memory Error has occurred because a Dual Inline Memory
Module (DIMM) is not functioning.
Check the System Event Log (SEL) to identify the non-functioning DIMM, and then
replace it.
RobH claimed this task.
RobH changed the status of subtask T216716: cp5007 correctable mem errors from Open to Stalled.
RobH moved this task from Backlog to Hardware Failure / Repair on the ops-eqsin board.
RobH changed the status of subtask T216717: cp5006 correctable mem errors from Open to Stalled.