restbase1025 reported DIMM issues in getsel
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Apr 12 2020, 6:26 AM

Description

Today restbase1025 was down due to the following:

-------------------------------------------------------------------------------
Record:      8
Date/Time:   04/12/2020 01:28:12
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   04/12/2020 06:23:57
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B2.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   04/12/2020 06:23:58
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B2.
-------------------------------------------------------------------------------

Powercycled to see if it works but the DIMM bank should probably be replaced.

Event Timeline

Caught during boot:

UEFI0106: One or more memory correctable training errors have occurred on
memory slot: B2.
Remove input power to the system, reseat the DIMM module and restart the
system. If the correctable errors persist, replace the faulty memory module
identified in the message.

UEFI0106: One or more memory correctable training errors have occurred on
memory slot: B1.
Remove input power to the system, reseat the DIMM module and restart the
system. If the correctable errors persist, replace the faulty memory module
identified in the message.

Mentioned in SAL (#wikimedia-operations) [2020-04-12T06:32:11Z] <elukey> powerdown restbase1025 - T250027

elukey@puppetmaster1001:~$ sudo confctl depool --hostname restbase1025.eqiad.wmnet
eqiad/restbase/restbase/restbase1025.eqiad.wmnet: pooled changed yes => no
eqiad/restbase/restbase-backend/restbase1025.eqiad.wmnet: pooled changed yes => no
eqiad/restbase/restbase-ssl/restbase1025.eqiad.wmnet: pooled changed yes => no
WARNING:conftool.announce:conftool action : set/pooled=no; selector: name=restbase1025.eqiad.wmnet

elukey@puppetmaster1001:~$ sudo confctl select name=restbase1025.eqiad.wmnet get
{"restbase1025.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=restbase,service=restbase"}
{"restbase1025.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=restbase,service=restbase-backend"}
{"restbase1025.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=restbase,service=restbase-ssl"}

Host is powered down and depooled, ready for maintenance (it was constantly failing and powercycling basically)

@Eevans adding yourself to this task as FYI :)

I swapped the DIMM A side to B side to see if error disappears or presents itself on the same slot or if it followed the DIMM

I cleared the log, this is a paste of the error

Record: 17
Date/Time: 04/12/2020 06:30:30
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.

• Cmjohnson moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.Apr 13 2020, 5:11 PM

The host seems up, but the following is listed in dmesg:

[Mon Apr 13 17:54:20 2020] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[Mon Apr 13 17:54:20 2020] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[Mon Apr 13 17:54:20 2020] {1}[Hardware Error]: event severity: corrected
[Mon Apr 13 17:54:20 2020] {1}[Hardware Error]:  Error 0, type: corrected
[Mon Apr 13 17:54:20 2020] {1}[Hardware Error]:  fru_text: A2
[Mon Apr 13 17:54:20 2020] {1}[Hardware Error]:   section_type: memory error
[Mon Apr 13 17:54:20 2020] {1}[Hardware Error]:   error_status: 0x0000000000000400
[Mon Apr 13 17:54:20 2020] {1}[Hardware Error]:   physical_address: 0x0000000ff81437c0
[Mon Apr 13 17:54:20 2020] {1}[Hardware Error]:   node: 0 card: 1 module: 0 rank: 1 bank: 1 row: 58250 column: 896
[Mon Apr 13 17:54:20 2020] {1}[Hardware Error]:   error_type: 2, single-bit ECC
[Mon Apr 13 17:54:20 2020] {1}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000

Let's keep the host monitored, there might still be a problem with the ram bank.

• MoritzMuehlenhoff triaged this task as Medium priority.Apr 14 2020, 6:51 AM

• MoritzMuehlenhoff added a subscriber: hnowlan.

The error came back but on A2 this time. Bad DIMM This is under warranty, I will order a new DIMM and update task with the Dell ticket number

Ticket opened with Dell, SR1023451111

wiki_willy assigned this task to • Cmjohnson.Apr 27 2020, 7:29 PM

@elukey Dimm is on site Ping me on IRC i am on site right now if you are available to change it

@hnowlan @Eevans can you sync with @Jclark-ctr ?

@elukey @hnowlan @Eevans Time restricted in data center leaving now will be on site Thursday 2pm-4pm utc please ping me on irc if your able to assist Thursday.

replaced failed Dimm

hnowlan closed this task as Resolved.Apr 30 2020, 11:48 AM

restbase1025 reported DIMM issues in getselClosed, ResolvedPublicActions

Description

Event Timeline

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.

restbase1025 reported DIMM issues in getsel
Closed, ResolvedPublic
Actions