Page MenuHomePhabricator

cp5007 correctable mem errors
Closed, ResolvedPublic

Description

Front panel status LED is blinking amber.

dmesg:

[Sat Dec 22 10:41:31 2018] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[Sat Dec 22 10:41:31 2018] {1}[Hardware Error]: event severity: corrected
[Sat Dec 22 10:41:31 2018] {1}[Hardware Error]:  Error 0, type: corrected
[Sat Dec 22 10:41:31 2018] {1}[Hardware Error]:  fru_text: A5
[Sat Dec 22 10:41:31 2018] {1}[Hardware Error]:   section_type: memory error
[Sat Dec 22 10:41:31 2018] {1}[Hardware Error]:   error_status: 0x0000000000000400
[Sat Dec 22 10:41:31 2018] {1}[Hardware Error]:   physical_address: 0x0000002b1d8edfc0
[Sat Dec 22 10:41:31 2018] {1}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 1 row: 22987 column: 888 
[Sat Dec 22 10:41:31 2018] {1}[Hardware Error]:   error_type: 2, single-bit ECC
[Sat Dec 22 10:41:31 2018] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Sat Dec 22 10:41:31 2018] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Sat Dec 22 10:41:31 2018] EDAC sbridge MC0: TSC 2e61c6c6d5ad2f 
[Sat Dec 22 10:41:31 2018] EDAC sbridge MC0: ADDR 2b1d8edfc0 
[Sat Dec 22 10:41:31 2018] EDAC sbridge MC0: MISC 0 
[Sat Dec 22 10:41:31 2018] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1545475119 SOCKET 0 APIC 0
[Sat Dec 22 10:41:31 2018] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x2b1d8ed offset:0xfc0 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:4)
[Sat Dec 22 10:42:09 2018] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Sat Dec 22 10:42:09 2018] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[Sat Dec 22 10:42:09 2018] {2}[Hardware Error]: event severity: corrected
[Sat Dec 22 10:42:09 2018] {2}[Hardware Error]:  Error 0, type: corrected
[Sat Dec 22 10:42:09 2018] {2}[Hardware Error]:  fru_text: A5
[Sat Dec 22 10:42:09 2018] {2}[Hardware Error]:   section_type: memory error
[Sat Dec 22 10:42:09 2018] {2}[Hardware Error]:   error_status: 0x0000000000000400
[Sat Dec 22 10:42:09 2018] {2}[Hardware Error]:   physical_address: 0x0000002bec92de80
[Sat Dec 22 10:42:09 2018] {2}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 1 row: 22228 column: 888 
[Sat Dec 22 10:42:09 2018] {2}[Hardware Error]:   error_type: 2, single-bit ECC
[Sat Dec 22 10:42:09 2018] mce: [Hardware Error]: Machine check events logged
[Sat Dec 22 10:42:09 2018] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Sat Dec 22 10:42:09 2018] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Sat Dec 22 10:42:09 2018] EDAC sbridge MC0: TSC 2e61da064158a0 
[Sat Dec 22 10:42:09 2018] EDAC sbridge MC0: ADDR 2bec92de80 
[Sat Dec 22 10:42:09 2018] EDAC sbridge MC0: MISC 0 
[Sat Dec 22 10:42:09 2018] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1545475156 SOCKET 0 APIC 0
[Sat Dec 22 10:42:09 2018] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x2bec92d offset:0xe80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:4)
[Sat Dec 22 10:42:22 2018] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Sat Dec 22 10:42:22 2018] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[Sat Dec 22 10:42:22 2018] {3}[Hardware Error]: event severity: corrected
[Sat Dec 22 10:42:22 2018] {3}[Hardware Error]:  Error 0, type: corrected
[Sat Dec 22 10:42:22 2018] {3}[Hardware Error]:  fru_text: A5
[Sat Dec 22 10:42:22 2018] {3}[Hardware Error]:   section_type: memory error
[Sat Dec 22 10:42:22 2018] {3}[Hardware Error]:   error_status: 0x0000000000000400
[Sat Dec 22 10:42:22 2018] {3}[Hardware Error]:   physical_address: 0x0000000bf402df80
[Sat Dec 22 10:42:22 2018] {3}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 1 row: 22352 column: 888 
[Sat Dec 22 10:42:22 2018] {3}[Hardware Error]:   error_type: 2, single-bit ECC
[Sat Dec 22 10:42:22 2018] mce: [Hardware Error]: Machine check events logged
[Sat Dec 22 10:42:22 2018] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Sat Dec 22 10:42:22 2018] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Sat Dec 22 10:42:22 2018] EDAC sbridge MC0: TSC 2e61e07021f560 
[Sat Dec 22 10:42:22 2018] EDAC sbridge MC0: ADDR bf402df80 
[Sat Dec 22 10:42:22 2018] EDAC sbridge MC0: MISC 0 
[Sat Dec 22 10:42:22 2018] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1545475169 SOCKET 0 APIC 0
[Sat Dec 22 10:42:22 2018] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0xbf402d offset:0xf80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:4)
[Sat Dec 22 10:42:36 2018] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Sat Dec 22 10:42:36 2018] {4}[Hardware Error]: It has been corrected by h/w and requires no further action
[Sat Dec 22 10:42:36 2018] {4}[Hardware Error]: event severity: corrected
[Sat Dec 22 10:42:36 2018] {4}[Hardware Error]:  Error 0, type: corrected
[Sat Dec 22 10:42:36 2018] {4}[Hardware Error]:  fru_text: A5
[Sat Dec 22 10:42:36 2018] {4}[Hardware Error]:   section_type: memory error
[Sat Dec 22 10:42:36 2018] {4}[Hardware Error]:   error_status: 0x0000000000000400
[Sat Dec 22 10:42:36 2018] {4}[Hardware Error]:   physical_address: 0x0000000bf402df80
[Sat Dec 22 10:42:36 2018] {4}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 1 row: 22352 column: 888 
[Sat Dec 22 10:42:36 2018] {4}[Hardware Error]:   error_type: 2, single-bit ECC
[Sat Dec 22 10:42:36 2018] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Sat Dec 22 10:42:36 2018] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Sat Dec 22 10:42:36 2018] EDAC sbridge MC0: TSC 2e61e7b25d1fc3 
[Sat Dec 22 10:42:36 2018] EDAC sbridge MC0: ADDR bf402df80 
[Sat Dec 22 10:42:36 2018] EDAC sbridge MC0: MISC 0 
[Sat Dec 22 10:42:36 2018] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1545475183 SOCKET 0 APIC 0
[Sat Dec 22 10:42:36 2018] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0xbf402d offset:0xf80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:4)
[Sat Dec 22 10:46:26 2018] mce_notify_irq: 1 callbacks suppressed
[Sat Dec 22 10:46:26 2018] mce: [Hardware Error]: Machine check events logged

SEL:

-------------------------------------------------------------------------------
Record:      14
Date/Time:   12/22/2018 09:36:57
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_A5.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   12/22/2018 09:37:01
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_A5.
-------------------------------------------------------------------------------

Related Objects

StatusSubtypeAssignedTask
ResolvedRobH
ResolvedRobH

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Please note that Dell support typically requires the following steps to be taken for any memory replacement:

  • Update bios firmware on host to latest revision
    • current version is 2.9.1, installed version is 2.5.4
  • Move memory to a different dimm slot, if error follows memory it is a bad dim, if it stays in the slot its a bad mainboard
    • Hitting F11 during post allows the running of system hardware tests.
  • clear the SEL of all errors BEFORE running dell diagnostics (diagnostics will fail if ANY errors are in the SEL)
  • generate dell support log export AFTER the memory dimm swap and firmware update, as those actions will be visible in the report.

Please note I cannot find any other history of hardware failures or issues for this host.

Drivers link: Drivers link: https://www.dell.com/support/home/us/en/19/product-support/servicetag/2BDN9M2/drivers

Mentioned in SAL (#wikimedia-operations) [2019-02-21T17:54:03Z] <robh> cp5007 rebooting into bios update and hardware testing via T216716

3 $> ssh root@cp5007.mgmt.eqsin.wmnet
root@cp5007.mgmt.eqsin.wmnet's password: 
/admin1-> racadm getsel
Record:      1
Date/Time:   10/31/2017 14:19:03
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   12/05/2017 16:57:16
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   12/05/2017 16:57:21
Source:      system
Severity:    Critical
Description: The power input for power supply 1 is lost.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   12/06/2017 09:58:33
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   12/06/2017 09:58:33
Source:      system
Severity:    Ok
Description: The input power for power supply 1 has been restored.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   12/06/2017 16:17:36
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   12/06/2017 16:17:41
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   10/12/2018 14:10:35
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   10/12/2018 14:10:35
Source:      system
Severity:    Critical
Description: The power input for power supply 1 is lost.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   10/14/2018 16:20:23
Source:      system
Severity:    Critical
Description: The power input for power supply 2 is lost.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   10/14/2018 16:20:23
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   10/14/2018 19:05:27
Source:      system
Severity:    Ok
Description: The input power for power supply 2 has been restored.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   10/14/2018 19:05:32
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      14
Date/Time:   12/22/2018 09:36:57
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_A5.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   12/22/2018 09:37:01
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_A5.
-------------------------------------------------------------------------------
/admin1-> racadm clrsel

Then I ran firmware update, and then got the following in post:

Initializing PCIe, USB, and Video... Done
iDRAC IP:  10.132.129.107
Loading Lifecycle Controller Drivers...
Loading Lifecycle Controller Drivers...Done
Initializing Firmware Interfaces...
 

UEFI0107: One or more memory errors have occurred on memory slot: A1.
Remove input power to the system, reseat the DIMM module and restart the
system. If the issues persist, replace the faulty memory module identified in
the message.

UEFI0081: Memory configuration has changed from the last time the system was
started.
If the change is expected, no action is necessary. Otherwise, check the DIMM
population inside the system and memory settings in System Setup.

UEFI0058: Uncorrectable Memory Error has occurred because a Dual Inline Memory
Module (DIMM) is not functioning.
Check the System Event Log (SEL) to identify the non-functioning DIMM, and then
replace it.
 

Available Actions:
F1 to Continue and Retry Boot Order
F2 for System Setup (BIOS)
F10 for LifeCycle Controller
- Enable/Configure iDRAC
- Update or Backup/Restore Server Firmware
- Help Install an Operating System
F11 for Boot Manager

It suggests polling the SEL:

/admin1-> racadm getsel
Record:      1
Date/Time:   02/21/2019 16:48:48
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   02/21/2019 17:58:51
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      3
Date/Time:   02/21/2019 17:58:51
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.
-------------------------------------------------------------------------------
/admin1->

Second attempt to reboot into bios flash successful, bios update running now.

bios update successful. I've cleared the SEL so I can launch Dell hardware testing utility.

Please note the hardware testing is still running on this system. I'm monitoring its serial output, but @ayounsi shouldn't modify the system until I update (or unless he attaches a crash cart and confirms its not running the test any longer.)

So, hardware testing completed, both the quick and in depth testing offered by the Dell utility selected during POST.

However, previous SEL entries (posted above) show issues in dimm slot A5 and A1. POST issues demonstrate this, so it may be best to open a support case to replace the mainboard.

If @ayounsi has a local copy of memtest86 to run, that may also help track down which dimms are bad. At this point we either have 2 bad dimms, or a bad mainboard. replacing the dimms out is far, far easier than a mainboard repacement.

Since onsite time is limited, it may be best for Arzhel to swap dimm A1 to A2, and swap dimm a5 to a4. This moves two questionable dimms to two slots that haven't reported errors.

This way if more errors are reported on the new slots, we know its bad dimms and can request replacement from Dell by showing the logs. (The dell support report lists all hardware changes in the system, so they'll see the migration of memory for testing.)

Swapped A1 with A2 and A4 with A5

Mentioned in SAL (#wikimedia-operations) [2019-02-22T17:14:09Z] <bblack> cp5007: repooling into service - T216716

As of 2019-02-25 @ 19:12 there are no memory errors logged post dimm slot swap.

ema triaged this task as Medium priority.Mar 6 2019, 10:11 AM
ema added a subscriber: ema.

Can this be closed?

RobH changed the task status from Open to Stalled.Mar 6 2019, 5:05 PM
RobH lowered the priority of this task from Medium to Low.

I'm keeping them open for a month after the memory swap for followup.

/admin1-> racadm getsel
Record:      1
Date/Time:   02/21/2019 18:11:12
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   02/22/2019 02:07:16
Source:      system
Severity:    Critical
Description: The chassis is open while the power is off.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   02/22/2019 02:07:21
Source:      system
Severity:    Ok
Description: The chassis is closed while the power is off.
-------------------------------------------------------------------------------
/admin1->

no further errors, resolving