Page MenuHomePhabricator

restbase1025 reported DIMM issues in getsel
Closed, ResolvedPublic

Description

Today restbase1025 was down due to the following:

-------------------------------------------------------------------------------
Record:      8
Date/Time:   04/12/2020 01:28:12
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   04/12/2020 06:23:57
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B2.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   04/12/2020 06:23:58
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B2.
-------------------------------------------------------------------------------

Powercycled to see if it works but the DIMM bank should probably be replaced.

Event Timeline

Caught during boot:

UEFI0106: One or more memory correctable training errors have occurred on
memory slot: B2.
Remove input power to the system, reseat the DIMM module and restart the
system. If the correctable errors persist, replace the faulty memory module
identified in the message.

UEFI0106: One or more memory correctable training errors have occurred on
memory slot: B1.
Remove input power to the system, reseat the DIMM module and restart the
system. If the correctable errors persist, replace the faulty memory module
identified in the message.
elukey@puppetmaster1001:~$ sudo confctl depool --hostname restbase1025.eqiad.wmnet
eqiad/restbase/restbase/restbase1025.eqiad.wmnet: pooled changed yes => no
eqiad/restbase/restbase-backend/restbase1025.eqiad.wmnet: pooled changed yes => no
eqiad/restbase/restbase-ssl/restbase1025.eqiad.wmnet: pooled changed yes => no
WARNING:conftool.announce:conftool action : set/pooled=no; selector: name=restbase1025.eqiad.wmnet

elukey@puppetmaster1001:~$ sudo confctl select name=restbase1025.eqiad.wmnet get
{"restbase1025.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=restbase,service=restbase"}
{"restbase1025.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=restbase,service=restbase-backend"}
{"restbase1025.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=restbase,service=restbase-ssl"}

Host is powered down and depooled, ready for maintenance (it was constantly failing and powercycling basically)

@Eevans adding yourself to this task as FYI :)

I swapped the DIMM A side to B side to see if error disappears or presents itself on the same slot or if it followed the DIMM

I cleared the log, this is a paste of the error

Record: 17
Date/Time: 04/12/2020 06:30:30
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.

The host seems up, but the following is listed in dmesg:

[Mon Apr 13 17:54:20 2020] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[Mon Apr 13 17:54:20 2020] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[Mon Apr 13 17:54:20 2020] {1}[Hardware Error]: event severity: corrected
[Mon Apr 13 17:54:20 2020] {1}[Hardware Error]:  Error 0, type: corrected
[Mon Apr 13 17:54:20 2020] {1}[Hardware Error]:  fru_text: A2
[Mon Apr 13 17:54:20 2020] {1}[Hardware Error]:   section_type: memory error
[Mon Apr 13 17:54:20 2020] {1}[Hardware Error]:   error_status: 0x0000000000000400
[Mon Apr 13 17:54:20 2020] {1}[Hardware Error]:   physical_address: 0x0000000ff81437c0
[Mon Apr 13 17:54:20 2020] {1}[Hardware Error]:   node: 0 card: 1 module: 0 rank: 1 bank: 1 row: 58250 column: 896
[Mon Apr 13 17:54:20 2020] {1}[Hardware Error]:   error_type: 2, single-bit ECC
[Mon Apr 13 17:54:20 2020] {1}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000

Let's keep the host monitored, there might still be a problem with the ram bank.

MoritzMuehlenhoff added a subscriber: hnowlan.

The error came back but on A2 this time. Bad DIMM This is under warranty, I will order a new DIMM and update task with the Dell ticket number

Ticket opened with Dell, SR1023451111

@elukey Dimm is on site Ping me on IRC i am on site right now if you are available to change it

@elukey @hnowlan @Eevans Time restricted in data center leaving now will be on site Thursday 2pm-4pm utc please ping me on irc if your able to assist Thursday.