Page MenuHomePhabricator

mw2251 failed memory dimm
Closed, ResolvedPublic

Description

Hung a few minutes ago, unresponsive on mgmt console, no ping, had to reboot. Found these in the kernel log:

Nov 22 18:30:38 mw2251 kernel: [53617.891710] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Nov 22 18:30:38 mw2251 kernel: [53617.891714] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
Nov 22 18:30:38 mw2251 kernel: [53617.891716] {1}[Hardware Error]: event severity: corrected
Nov 22 18:30:38 mw2251 kernel: [53617.891719] {1}[Hardware Error]:  Error 0, type: corrected
Nov 22 18:30:38 mw2251 kernel: [53617.891720] {1}[Hardware Error]:  fru_text: A1
Nov 22 18:30:38 mw2251 kernel: [53617.891722] {1}[Hardware Error]:   section_type: memory error
Nov 22 18:30:38 mw2251 kernel: [53617.891725] {1}[Hardware Error]:   error_status: 0x0000000000000400
Nov 22 18:30:38 mw2251 kernel: [53617.891727] {1}[Hardware Error]:   physical_address: 0x00000007a3602400
Nov 22 18:30:38 mw2251 kernel: [53617.891732] {1}[Hardware Error]:   node: 0 card: 0 module: 0 rank: 1 bank: 2 row: 49592 column: 64 
Nov 22 18:30:38 mw2251 kernel: [53617.891734] {1}[Hardware Error]:   error_type: 2, single-bit ECC
Nov 22 18:30:38 mw2251 kernel: [53617.891756] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Nov 22 18:30:38 mw2251 kernel: [53617.891760] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
Nov 22 18:30:38 mw2251 kernel: [53617.891762] EDAC sbridge MC0: TSC 6b545750aad2 
Nov 22 18:30:38 mw2251 kernel: [53617.891764] EDAC sbridge MC0: ADDR 7a3602400 
Nov 22 18:30:38 mw2251 kernel: [53617.891766] EDAC sbridge MC0: MISC 0 
Nov 22 18:30:38 mw2251 kernel: [53617.891769] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1511375438 SOCKET 0 APIC 0
Nov 22 18:30:38 mw2251 kernel: [53617.891798] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x7a3602 offset:0x400 grain:32 syndrome:0x0 -  area:DRAM e
rr_code:0000:009f socket:0 ha:0 channel_mask:1 rank:1)
Nov 22 18:30:40 mw2251 kernel: [53618.219309] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 135.996 msecs
Nov 22 18:30:41 mw2251 kernel: [53618.219314] perf: interrupt took too long (356655 > 6203), lowering kernel.perf_event_max_sample_rate to 500
Nov 22 18:30:41 mw2251 kernel: [53618.264509] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Nov 22 18:30:41 mw2251 kernel: [53618.264511] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
Nov 22 18:30:41 mw2251 kernel: [53618.264512] {2}[Hardware Error]: event severity: corrected
Nov 22 18:30:41 mw2251 kernel: [53618.264513] {2}[Hardware Error]:  Error 0, type: corrected
Nov 22 18:30:41 mw2251 kernel: [53618.264514] {2}[Hardware Error]:  fru_text: A1
Nov 22 18:30:41 mw2251 kernel: [53618.264515] {2}[Hardware Error]:   section_type: memory error
Nov 22 18:30:41 mw2251 kernel: [53618.264516] {2}[Hardware Error]:   error_status: 0x0000000000000400
Nov 22 18:30:41 mw2251 kernel: [53618.264518] {2}[Hardware Error]:   physical_address: 0x00000007a3603200
Nov 22 18:30:41 mw2251 kernel: [53618.264521] {2}[Hardware Error]:   node: 0 card: 0 module: 0 rank: 1 bank: 2 row: 49592 column: 288 
Nov 22 18:30:41 mw2251 kernel: [53618.264522] {2}[Hardware Error]:   error_type: 2, single-bit ECC
Nov 22 18:30:41 mw2251 kernel: [53618.264529] mce: [Hardware Error]: Machine check events logged
Nov 22 18:30:41 mw2251 kernel: [53618.780860] INFO: NMI handler (ghes_notify_nmi) took too long to run: 45.050 msecs
Nov 22 18:30:41 mw2251 kernel: [53619.286816] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 45.035 msecs
Nov 22 18:30:41 mw2251 kernel: [53620.289217] INFO: NMI handler (ghes_notify_nmi) took too long to run: 45.069 msecs
Nov 22 18:30:41 mw2251 kernel: [53620.334641] INFO: NMI handler (ghes_notify_nmi) took too long to run: 45.401 msecs
Nov 22 18:30:41 mw2251 kernel: [53620.834701] INFO: NMI handler (ghes_notify_nmi) took too long to run: 45.407 msecs
Nov 22 18:30:43 mw2251 kernel: [53622.101085] sched: RT throttling activated
Nov 22 18:30:43 mw2251 kernel: [53622.146134] INFO: NMI handler (ghes_notify_nmi) took too long to run: 1130.085 msecs
Nov 22 18:30:43 mw2251 kernel: [53622.787521] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR

and so on.

Event Timeline

So it appears that there is a memory issue:

/admin1-> racadm getsel
Record: 1
Date/Time: 11/16/2017 18:59:13
Source: system
Severity: Ok

Description: Log cleared.

Record: 2
Date/Time: 11/22/2017 03:33:04
Source: system
Severity: Ok

Description: A problem was detected related to the previous server boot.

Record: 3
Date/Time: 11/22/2017 03:33:04
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.

Record: 4
Date/Time: 11/22/2017 18:30:58
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_A1.

Record: 5
Date/Time: 11/23/2017 19:14:20
Source: system
Severity: Ok

Description: A problem was detected related to the previous server boot.

Record: 6
Date/Time: 11/23/2017 19:14:20
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.

This should have a self-dispatch created and sent out for repair/replacement. Since we have to have that done within 10 days of receiving the part, I'll sync up with @Papaul but this may end up waiting until AFTER next week's holiday week off.

RobH renamed this task from mw2251 problems to mw2251 failed memory dimm.Dec 21 2017, 1:30 AM
RobH assigned this task to Papaul.
RobH moved this task from Backlog to Up Next on the ops-codfw board.

Dear Tshibamba, Papaul,

Your dispatch shipped on 1/2/2018 2:55 PM
Dispatch Number: 342210423
Work Order Number: SR958796798

Papaul added a subscriber: elukey.

Memory replacement complete
Upgrade IDRAC from version 2.41 to 2.50
Upgrade BIOS from version 2.3.4 to 2.6.0

Server is back up @elukey

Did a scap pull, set the host to pooled=yes and checked apache metrics. Everything looks good! Closing the task, let's re-open if it gives problems again.