Page MenuHomePhabricator

cp5012 memory errors
Closed, ResolvedPublic

Description

cp5012 crashed on 2020-04-28 at 01:28:56, prior to the crash several memory errors have been logged:

/var/log/kern.log
Apr 28 01:25:15 cp5012 kernel: [2378102.616593] {12}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
Apr 28 01:25:15 cp5012 kernel: [2378102.616595] {12}[Hardware Error]: It has been corrected by h/w and requires no further action
Apr 28 01:25:15 cp5012 kernel: [2378102.616597] {12}[Hardware Error]: event severity: corrected
Apr 28 01:25:15 cp5012 kernel: [2378102.616599] {12}[Hardware Error]:  Error 0, type: corrected
Apr 28 01:25:15 cp5012 kernel: [2378102.616599] {12}[Hardware Error]:  fru_text: A3
Apr 28 01:25:15 cp5012 kernel: [2378102.616601] {12}[Hardware Error]:   section_type: memory error
Apr 28 01:25:15 cp5012 kernel: [2378102.616602] {12}[Hardware Error]:   error_status: 0x0000000000000400
Apr 28 01:25:15 cp5012 kernel: [2378102.616603] {12}[Hardware Error]:   physical_address: 0x0000003d00607fc0
Apr 28 01:25:15 cp5012 kernel: [2378102.616606] {12}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 0 bank: 1 row: 59392 column: 504
Apr 28 01:25:15 cp5012 kernel: [2378102.616607] {12}[Hardware Error]:   error_type: 2, single-bit ECC

Checking the SEL it looks like cp5012 has been suffering RAM issues for a while now:

racadm getsel
-------------------------------------------------------------------------------
Record:      19
Date/Time:   03/25/2020 09:04:32
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_A5.
-------------------------------------------------------------------------------
Record:      20
Date/Time:   03/25/2020 09:05:04
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_A5.
-------------------------------------------------------------------------------
Record:      21
Date/Time:   04/28/2020 01:21:00
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_A3.
-------------------------------------------------------------------------------
Record:      22
Date/Time:   04/28/2020 01:23:47
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_A3.
-------------------------------------------------------------------------------
[...]
-------------------------------------------------------------------------------
Record:      24
Date/Time:   04/28/2020 01:28:10
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
-------------------------------------------------------------------------------
Record:      25
Date/Time:   04/28/2020 01:28:10
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.
-------------------------------------------------------------------------------
Record:      26
Date/Time:   04/28/2020 01:28:10
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.
-------------------------------------------------------------------------------
Record:      27
Date/Time:   04/28/2020 01:28:10
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.
-------------------------------------------------------------------------------
Record:      28
Date/Time:   04/28/2020 01:28:10
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
-------------------------------------------------------------------------------
Record:      29
Date/Time:   04/28/2020 01:28:10
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
-------------------------------------------------------------------------------

Related Objects

StatusSubtypeAssignedTask
ResolvedVgutierrez

Event Timeline

Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald TranscriptApr 28 2020, 5:19 AM
Vgutierrez triaged this task as Medium priority.Apr 28 2020, 5:19 AM
Vgutierrez moved this task from Triage to Hardware on the Traffic board.
Vgutierrez updated the task description. (Show Details)Apr 28 2020, 7:26 AM

Checked Netbox and the server looks like it's still under warranty until October of this year.

do we have an ETA on this one? :)

wiki_willy reassigned this task from Cmjohnson to RobH.Fri, May 8, 5:48 PM
wiki_willy added subscribers: RobH, Cmjohnson.

@Vgutierrez - my apologies, I initially mistook this as a host in eqiad, instead of eqsin, so had assigned it to the wrong person last week. Re-assigning now to @RobH, who might be able to work with our 3rd party vendor on troubleshooting/fixing this. Thanks, Willy

RobH added a parent task: Unknown Object (Task).Wed, May 20, 4:48 PM
RobH added a comment.Wed, May 20, 5:32 PM

Ok, for memory tests we need to clear the SEL, so just dumping its output here for easy review later (its stored in the server still but not readable without a data dump and sorting):

admin1-> racadm getsel
Record:      1
Date/Time:   10/31/2017 12:57:09
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   12/05/2017 16:57:34
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   10/12/2018 14:10:38
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   10/12/2018 14:10:38
Source:      system
Severity:    Critical
Description: The power input for power supply 1 is lost.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   10/12/2018 20:37:53
Source:      system
Severity:    Ok
Description: The input power for power supply 1 has been restored.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   10/12/2018 20:38:03
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   10/14/2018 14:09:33
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   10/14/2018 14:09:33
Source:      system
Severity:    Critical
Description: The power input for power supply 2 is lost.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   10/14/2018 19:01:53
Source:      system
Severity:    Ok
Description: The input power for power supply 2 has been restored.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   10/14/2018 19:02:03
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   02/07/2020 02:27:44
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   02/07/2020 02:27:44
Source:      system
Severity:    Critical
Description: The power input for power supply 1 is lost.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   02/07/2020 02:37:50
Source:      system
Severity:    Ok
Description: The input power for power supply 1 has been restored.
-------------------------------------------------------------------------------
Record:      14
Date/Time:   02/07/2020 02:37:54
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   02/07/2020 02:43:40
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      16
Date/Time:   02/07/2020 02:43:40
Source:      system
Severity:    Critical
Description: The power input for power supply 2 is lost.
-------------------------------------------------------------------------------
Record:      17
Date/Time:   02/07/2020 03:00:55
Source:      system
Severity:    Ok
Description: The input power for power supply 2 has been restored.
-------------------------------------------------------------------------------
Record:      18
Date/Time:   02/07/2020 03:01:05
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      19
Date/Time:   03/25/2020 09:04:32
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_A5.
-------------------------------------------------------------------------------
Record:      20
Date/Time:   03/25/2020 09:05:04
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_A5.
-------------------------------------------------------------------------------
Record:      21
Date/Time:   04/28/2020 01:21:00
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_A3.
-------------------------------------------------------------------------------
Record:      22
Date/Time:   04/28/2020 01:23:47
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_A3.
-------------------------------------------------------------------------------
Record:      23
Date/Time:   04/28/2020 01:28:10
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      24
Date/Time:   04/28/2020 01:28:10
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
-------------------------------------------------------------------------------
Record:      25
Date/Time:   04/28/2020 01:28:10
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.
-------------------------------------------------------------------------------
Record:      26
Date/Time:   04/28/2020 01:28:10
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.
-------------------------------------------------------------------------------
Record:      27
Date/Time:   04/28/2020 01:28:10
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.
-------------------------------------------------------------------------------
Record:      28
Date/Time:   04/28/2020 01:28:10
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
-------------------------------------------------------------------------------
Record:      29
Date/Time:   04/28/2020 01:28:10
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
-------------------------------------------------------------------------------
/admin1->

Mentioned in SAL (#wikimedia-operations) [2020-05-20T19:27:11Z] <robh> cp5012 still offline for mem tests, "fast" testing complete without errors and extended testing in progress. system firmware was updated before testing. T251219

RobH reassigned this task from RobH to Vgutierrez.Wed, May 20, 9:12 PM

So this ran the full suite of Dell tests, including extended memory testing, without failure. I did update the firmware before testing though.

@Vgutierrez Can we return this to service and load to see if the firmware update fixed the issue?

Mentioned in SAL (#wikimedia-operations) [2020-05-21T06:04:35Z] <vgutierrez> pool cp5012 - T251219

Vgutierrez changed the task status from Open to Stalled.Thu, May 21, 6:05 AM

@RobH done. let's see how it goes, thanks!

Vgutierrez closed this task as Resolved.Fri, May 22, 9:49 AM

cp5012 seems stable, I'll reopen this task if I see any sign of memory issues.

Thanks @RobH