Page MenuHomePhabricator

cp5007 correctable mem errors
Closed, ResolvedPublic

Description

Front panel status LED is blinking amber.

dmesg:

[Sat Dec 22 10:41:31 2018] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[Sat Dec 22 10:41:31 2018] {1}[Hardware Error]: event severity: corrected
[Sat Dec 22 10:41:31 2018] {1}[Hardware Error]:  Error 0, type: corrected
[Sat Dec 22 10:41:31 2018] {1}[Hardware Error]:  fru_text: A5
[Sat Dec 22 10:41:31 2018] {1}[Hardware Error]:   section_type: memory error
[Sat Dec 22 10:41:31 2018] {1}[Hardware Error]:   error_status: 0x0000000000000400
[Sat Dec 22 10:41:31 2018] {1}[Hardware Error]:   physical_address: 0x0000002b1d8edfc0
[Sat Dec 22 10:41:31 2018] {1}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 1 row: 22987 column: 888 
[Sat Dec 22 10:41:31 2018] {1}[Hardware Error]:   error_type: 2, single-bit ECC
[Sat Dec 22 10:41:31 2018] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Sat Dec 22 10:41:31 2018] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Sat Dec 22 10:41:31 2018] EDAC sbridge MC0: TSC 2e61c6c6d5ad2f 
[Sat Dec 22 10:41:31 2018] EDAC sbridge MC0: ADDR 2b1d8edfc0 
[Sat Dec 22 10:41:31 2018] EDAC sbridge MC0: MISC 0 
[Sat Dec 22 10:41:31 2018] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1545475119 SOCKET 0 APIC 0
[Sat Dec 22 10:41:31 2018] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x2b1d8ed offset:0xfc0 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:4)
[Sat Dec 22 10:42:09 2018] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Sat Dec 22 10:42:09 2018] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[Sat Dec 22 10:42:09 2018] {2}[Hardware Error]: event severity: corrected
[Sat Dec 22 10:42:09 2018] {2}[Hardware Error]:  Error 0, type: corrected
[Sat Dec 22 10:42:09 2018] {2}[Hardware Error]:  fru_text: A5
[Sat Dec 22 10:42:09 2018] {2}[Hardware Error]:   section_type: memory error
[Sat Dec 22 10:42:09 2018] {2}[Hardware Error]:   error_status: 0x0000000000000400
[Sat Dec 22 10:42:09 2018] {2}[Hardware Error]:   physical_address: 0x0000002bec92de80
[Sat Dec 22 10:42:09 2018] {2}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 1 row: 22228 column: 888 
[Sat Dec 22 10:42:09 2018] {2}[Hardware Error]:   error_type: 2, single-bit ECC
[Sat Dec 22 10:42:09 2018] mce: [Hardware Error]: Machine check events logged
[Sat Dec 22 10:42:09 2018] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Sat Dec 22 10:42:09 2018] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Sat Dec 22 10:42:09 2018] EDAC sbridge MC0: TSC 2e61da064158a0 
[Sat Dec 22 10:42:09 2018] EDAC sbridge MC0: ADDR 2bec92de80 
[Sat Dec 22 10:42:09 2018] EDAC sbridge MC0: MISC 0 
[Sat Dec 22 10:42:09 2018] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1545475156 SOCKET 0 APIC 0
[Sat Dec 22 10:42:09 2018] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x2bec92d offset:0xe80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:4)
[Sat Dec 22 10:42:22 2018] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Sat Dec 22 10:42:22 2018] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[Sat Dec 22 10:42:22 2018] {3}[Hardware Error]: event severity: corrected
[Sat Dec 22 10:42:22 2018] {3}[Hardware Error]:  Error 0, type: corrected
[Sat Dec 22 10:42:22 2018] {3}[Hardware Error]:  fru_text: A5
[Sat Dec 22 10:42:22 2018] {3}[Hardware Error]:   section_type: memory error
[Sat Dec 22 10:42:22 2018] {3}[Hardware Error]:   error_status: 0x0000000000000400
[Sat Dec 22 10:42:22 2018] {3}[Hardware Error]:   physical_address: 0x0000000bf402df80
[Sat Dec 22 10:42:22 2018] {3}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 1 row: 22352 column: 888 
[Sat Dec 22 10:42:22 2018] {3}[Hardware Error]:   error_type: 2, single-bit ECC
[Sat Dec 22 10:42:22 2018] mce: [Hardware Error]: Machine check events logged
[Sat Dec 22 10:42:22 2018] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Sat Dec 22 10:42:22 2018] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Sat Dec 22 10:42:22 2018] EDAC sbridge MC0: TSC 2e61e07021f560 
[Sat Dec 22 10:42:22 2018] EDAC sbridge MC0: ADDR bf402df80 
[Sat Dec 22 10:42:22 2018] EDAC sbridge MC0: MISC 0 
[Sat Dec 22 10:42:22 2018] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1545475169 SOCKET 0 APIC 0
[Sat Dec 22 10:42:22 2018] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0xbf402d offset:0xf80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:4)
[Sat Dec 22 10:42:36 2018] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Sat Dec 22 10:42:36 2018] {4}[Hardware Error]: It has been corrected by h/w and requires no further action
[Sat Dec 22 10:42:36 2018] {4}[Hardware Error]: event severity: corrected
[Sat Dec 22 10:42:36 2018] {4}[Hardware Error]:  Error 0, type: corrected
[Sat Dec 22 10:42:36 2018] {4}[Hardware Error]:  fru_text: A5
[Sat Dec 22 10:42:36 2018] {4}[Hardware Error]:   section_type: memory error
[Sat Dec 22 10:42:36 2018] {4}[Hardware Error]:   error_status: 0x0000000000000400
[Sat Dec 22 10:42:36 2018] {4}[Hardware Error]:   physical_address: 0x0000000bf402df80
[Sat Dec 22 10:42:36 2018] {4}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 1 row: 22352 column: 888 
[Sat Dec 22 10:42:36 2018] {4}[Hardware Error]:   error_type: 2, single-bit ECC
[Sat Dec 22 10:42:36 2018] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Sat Dec 22 10:42:36 2018] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Sat Dec 22 10:42:36 2018] EDAC sbridge MC0: TSC 2e61e7b25d1fc3 
[Sat Dec 22 10:42:36 2018] EDAC sbridge MC0: ADDR bf402df80 
[Sat Dec 22 10:42:36 2018] EDAC sbridge MC0: MISC 0 
[Sat Dec 22 10:42:36 2018] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1545475183 SOCKET 0 APIC 0
[Sat Dec 22 10:42:36 2018] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0xbf402d offset:0xf80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:4)
[Sat Dec 22 10:46:26 2018] mce_notify_irq: 1 callbacks suppressed
[Sat Dec 22 10:46:26 2018] mce: [Hardware Error]: Machine check events logged

SEL:

-------------------------------------------------------------------------------
Record:      14
Date/Time:   12/22/2018 09:36:57
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_A5.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   12/22/2018 09:37:01
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_A5.
-------------------------------------------------------------------------------

Related Objects

StatusAssignedTask
ResolvedRobH
ResolvedRobH

Event Timeline

BBlack created this task.Feb 21 2019, 2:21 PM
Restricted Application added a project: Operations. · View Herald TranscriptFeb 21 2019, 2:21 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
RobH added a subscriber: RobH.EditedFeb 21 2019, 4:52 PM

Please note that Dell support typically requires the following steps to be taken for any memory replacement:

  • Update bios firmware on host to latest revision
    • current version is 2.9.1, installed version is 2.5.4
  • Move memory to a different dimm slot, if error follows memory it is a bad dim, if it stays in the slot its a bad mainboard
    • Hitting F11 during post allows the running of system hardware tests.
  • clear the SEL of all errors BEFORE running dell diagnostics (diagnostics will fail if ANY errors are in the SEL)
  • generate dell support log export AFTER the memory dimm swap and firmware update, as those actions will be visible in the report.

Please note I cannot find any other history of hardware failures or issues for this host.

Drivers link: Drivers link: https://www.dell.com/support/home/us/en/19/product-support/servicetag/2BDN9M2/drivers

Mentioned in SAL (#wikimedia-operations) [2019-02-21T17:54:03Z] <robh> cp5007 rebooting into bios update and hardware testing via T216716

RobH added a comment.Feb 21 2019, 6:07 PM
3 $> ssh root@cp5007.mgmt.eqsin.wmnet
root@cp5007.mgmt.eqsin.wmnet's password: 
/admin1-> racadm getsel
Record:      1
Date/Time:   10/31/2017 14:19:03
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   12/05/2017 16:57:16
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   12/05/2017 16:57:21
Source:      system
Severity:    Critical
Description: The power input for power supply 1 is lost.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   12/06/2017 09:58:33
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   12/06/2017 09:58:33
Source:      system
Severity:    Ok
Description: The input power for power supply 1 has been restored.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   12/06/2017 16:17:36
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   12/06/2017 16:17:41
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   10/12/2018 14:10:35
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   10/12/2018 14:10:35
Source:      system
Severity:    Critical
Description: The power input for power supply 1 is lost.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   10/14/2018 16:20:23
Source:      system
Severity:    Critical
Description: The power input for power supply 2 is lost.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   10/14/2018 16:20:23
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   10/14/2018 19:05:27
Source:      system
Severity:    Ok
Description: The input power for power supply 2 has been restored.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   10/14/2018 19:05:32
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      14
Date/Time:   12/22/2018 09:36:57
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_A5.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   12/22/2018 09:37:01
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_A5.
-------------------------------------------------------------------------------
/admin1-> racadm clrsel

Then I ran firmware update, and then got the following in post:

Initializing PCIe, USB, and Video... Done
iDRAC IP:  10.132.129.107
Loading Lifecycle Controller Drivers...
Loading Lifecycle Controller Drivers...Done
Initializing Firmware Interfaces...
 

UEFI0107: One or more memory errors have occurred on memory slot: A1.
Remove input power to the system, reseat the DIMM module and restart the
system. If the issues persist, replace the faulty memory module identified in
the message.

UEFI0081: Memory configuration has changed from the last time the system was
started.
If the change is expected, no action is necessary. Otherwise, check the DIMM
population inside the system and memory settings in System Setup.

UEFI0058: Uncorrectable Memory Error has occurred because a Dual Inline Memory
Module (DIMM) is not functioning.
Check the System Event Log (SEL) to identify the non-functioning DIMM, and then
replace it.
 

Available Actions:
F1 to Continue and Retry Boot Order
F2 for System Setup (BIOS)
F10 for LifeCycle Controller
- Enable/Configure iDRAC
- Update or Backup/Restore Server Firmware
- Help Install an Operating System
F11 for Boot Manager

It suggests polling the SEL:

/admin1-> racadm getsel
Record:      1
Date/Time:   02/21/2019 16:48:48
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   02/21/2019 17:58:51
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      3
Date/Time:   02/21/2019 17:58:51
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.
-------------------------------------------------------------------------------
/admin1->

Second attempt to reboot into bios flash successful, bios update running now.

RobH added a comment.Feb 21 2019, 6:11 PM

bios update successful. I've cleared the SEL so I can launch Dell hardware testing utility.

RobH added a subscriber: ayounsi.Feb 21 2019, 6:14 PM

Please note the hardware testing is still running on this system. I'm monitoring its serial output, but @ayounsi shouldn't modify the system until I update (or unless he attaches a crash cart and confirms its not running the test any longer.)

RobH added a comment.Feb 21 2019, 9:43 PM

So, hardware testing completed, both the quick and in depth testing offered by the Dell utility selected during POST.

However, previous SEL entries (posted above) show issues in dimm slot A5 and A1. POST issues demonstrate this, so it may be best to open a support case to replace the mainboard.

If @ayounsi has a local copy of memtest86 to run, that may also help track down which dimms are bad. At this point we either have 2 bad dimms, or a bad mainboard. replacing the dimms out is far, far easier than a mainboard repacement.

RobH assigned this task to ayounsi.Feb 21 2019, 9:46 PM

Since onsite time is limited, it may be best for Arzhel to swap dimm A1 to A2, and swap dimm a5 to a4. This moves two questionable dimms to two slots that haven't reported errors.

This way if more errors are reported on the new slots, we know its bad dimms and can request replacement from Dell by showing the logs. (The dell support report lists all hardware changes in the system, so they'll see the migration of memory for testing.)

Mentioned in SAL (#wikimedia-operations) [2019-02-22T01:58:21Z] <XioNoX> power-down cp5007 - T216716

Swapped A1 with A2 and A4 with A5

ayounsi reassigned this task from ayounsi to RobH.Feb 22 2019, 2:14 AM

Mentioned in SAL (#wikimedia-operations) [2019-02-22T17:14:09Z] <bblack> cp5007: repooling into service - T216716

RobH added a comment.Feb 25 2019, 7:12 PM

As of 2019-02-25 @ 19:12 there are no memory errors logged post dimm slot swap.

ema moved this task from Triage to Hardware on the Traffic board.Mar 6 2019, 9:56 AM
ema triaged this task as Normal priority.Mar 6 2019, 10:11 AM
ema added a subscriber: ema.

Can this be closed?

RobH changed the task status from Open to Stalled.Mar 6 2019, 5:05 PM
RobH lowered the priority of this task from Normal to Low.

I'm keeping them open for a month after the memory swap for followup.

ayounsi removed a subscriber: ayounsi.Mar 6 2019, 5:34 PM
RobH closed this task as Resolved.Jul 3 2019, 6:20 PM
/admin1-> racadm getsel
Record:      1
Date/Time:   02/21/2019 18:11:12
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   02/22/2019 02:07:16
Source:      system
Severity:    Critical
Description: The chassis is open while the power is off.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   02/22/2019 02:07:21
Source:      system
Severity:    Ok
Description: The chassis is closed while the power is off.
-------------------------------------------------------------------------------
/admin1->

no further errors, resolving

Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptJul 3 2019, 6:20 PM