Page MenuHomePhabricator

db1150 crashed: DIMM_A8 memory issues
Closed, ResolvedPublic

Description

Still under investigation:

[15:21:00]  <+icinga-wm> PROBLEM - Host db1150 is DOWN: PING CRITICAL - Packet loss = 100%

This is a backup source for s4 and s5

Event Timeline

Marostegui triaged this task as Medium priority.Mar 21 2023, 2:24 PM
Marostegui updated the task description. (Show Details)

racadm getsel shows RAM issues

-------------------------------------------------------------------------------
Record:      135
Date/Time:   03/21/2023 14:17:07
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A8.
-------------------------------------------------------------------------------
Record:      136
Date/Time:   03/21/2023 14:17:07
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      137
Date/Time:   03/21/2023 14:17:07
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      138
Date/Time:   03/21/2023 14:17:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      139
Date/Time:   03/21/2023 14:17:08
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A8.
-------------------------------------------------------------------------------
Record:      140
Date/Time:   03/21/2023 14:17:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      141
Date/Time:   03/21/2023 14:17:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      142
Date/Time:   03/21/2023 14:17:09
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      143
Date/Time:   03/21/2023 14:17:09
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A8.
-------------------------------------------------------------------------------

@wiki_willy we'd need help with the above. We probably need to get the DIMM swapped with another DIMM to see if the problem is the DIMM itself or the main board.
This host by still be under warranty until August.

Change 901601 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] monitoring: Disable notifications for db1150 after crash

https://gerrit.wikimedia.org/r/901601

jcrespo renamed this task from db1150 unexpectedly down to db1150 crashed: DIMM_A8 memory issues.Mar 21 2023, 2:46 PM

Change 901601 merged by Jcrespo:

[operations/puppet@production] monitoring: Disable notifications for db1150 after crash

https://gerrit.wikimedia.org/r/901601

Sadly it doesn't powercycle from the management interface, so requiring "manual" power drain and reboot when possible from DC-Ops.

Cmjohnson subscribed.

DIMM has been ordered through Dell

The host will be left unused and with notifications disabled so it can be serviced at any time (no rush). Thank you.

Change 901624 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Setup db1145 as a backup source replacement for db1150

https://gerrit.wikimedia.org/r/901624

FYI: Logged but forgot to add the ticket number:

jynus: running from cumin1001: transfer.py --type=decompress dbprov1003.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s5.2023-03-20--04-00-30.tar.gz db1145.eqiad.wmnet:/srv/sqldata.s5

Change 901624 merged by Jcrespo:

[operations/puppet@production] dbbackups: Setup db1145 as a backup source replacement for db1150

https://gerrit.wikimedia.org/r/901624

@Marostegui @jynus I apologize for the delay for this DIMM, Dell had a question that needed responding to and it's delaying the shipment. It should go out today.

@Marostegui @jynus I apologize for the delay for this DIMM, Dell had a question that needed responding to and it's delaying the shipment. It should go out today.

No worries, as I said- this is not in a rush. If you can (can be next week)- power drain it next time you are available, that can help us.

The DIMM has been replaced, I updated the idrac and bios while it was offline.

Icinga downtime and Alertmanager silence (ID=3b38157a-7d2c-4b9f-ad17-b2b2c6932dcb) set by jynus@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: reprovisioning after maintenance

db1150.eqiad.wmnet

Change 904764 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] database-backups: Provision db1150 with s4 and s3 sections

https://gerrit.wikimedia.org/r/904764

Change 904764 merged by Jcrespo:

[operations/puppet@production] database-backups: Provision db1150 with s4 and s3 sections

https://gerrit.wikimedia.org/r/904764