Page MenuHomePhabricator

db1145 crashed - memory issues
Closed, ResolvedPublic

Description

db1145 rebooted itself:

Times in UTC

[09:50:33]  <+icinga-wm>	PROBLEM - Host db1145 is DOWN: PING CRITICAL - Packet loss = 100%

Startup message:

UEFI0079: One or more uncorrectable Memory errors occurred in the previous
boot.
Check the System Event Log (SEL) to identify the non-functional DIMM, and then
replace the DIMM.


Available Actions:
F1 to Continue and Retry Boot Order
F2 for System Setup (BIOS)
F10 for Lifecycle Controller
- Enable/Configure iDRAC
- Update or Backup/Restore Server Firmware
- Help Install an Operating System
F11 for Boot Manager

This host is a backup source and it is under warranty

Event Timeline

More HW logs - we've got a broken DIMM apparently

Record:      1
Date/Time:   03/30/2020 23:27:03
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   07/17/2020 09:48:30
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B2.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   07/17/2020 09:48:30
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   07/17/2020 09:48:30
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   07/17/2020 09:48:30
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   07/17/2020 09:48:31
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   07/17/2020 09:48:39
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B2.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   07/17/2020 09:48:39
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   07/17/2020 09:48:40
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   07/17/2020 09:48:40
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   07/17/2020 09:48:47
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B2.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   07/17/2020 09:48:47
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   07/17/2020 09:48:47
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      14
Date/Time:   07/17/2020 09:48:48
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   07/17/2020 09:48:48
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
-------------------------------------------------------------------------------
Record:      16
Date/Time:   07/17/2020 09:48:48
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      17
Date/Time:   07/17/2020 09:48:48
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      18
Date/Time:   07/17/2020 09:48:48
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      19
Date/Time:   07/17/2020 09:48:48
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      20
Date/Time:   07/17/2020 09:48:48
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      21
Date/Time:   07/17/2020 09:48:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      22
Date/Time:   07/17/2020 09:48:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      23
Date/Time:   07/17/2020 09:48:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      24
Date/Time:   07/17/2020 09:48:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      25
Date/Time:   07/17/2020 09:48:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      26
Date/Time:   07/17/2020 09:48:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      27
Date/Time:   07/17/2020 09:48:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      28
Date/Time:   07/17/2020 09:51:41
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      29
Date/Time:   07/17/2020 09:51:41
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
-------------------------------------------------------------------------------

Change 613609 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1145: Disable notifications

https://gerrit.wikimedia.org/r/613609

Change 613609 merged by Marostegui:
[operations/puppet@production] db1145: Disable notifications

https://gerrit.wikimedia.org/r/613609

I have hit F1, to continue its boot, so we can see the OS.
The memory on the OS looks ok:

root@db1145:~# free -g
              total        used        free      shared  buff/cache   available
Mem:            502           1         501           0           0         499
Swap:             7           0           7

I won't start MySQL until @jcrespo gives green light but we should probably get an RMA started for that particular DIMM @wiki_willy

Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui updated the task description. (Show Details)

I've started a restore process from yesterday's backups.

The service is back up from backups so the backup service continues uninterrupted during the weekend. @wiki_willy let us know what is the next step as mentioned by Marostegui before and I can put down the server again next week if necessary.

@jcrespo remember I disabled notifications via puppet, I guess we should leave them disabled until the maintenance is done?

Yes, I agree with that option. Thanks for creating the ticket and doing the initial triage!

wiki_willy added a subscriber: Jclark-ctr.

@Jclark-ctr - can you check this one out when you're onsite next? It was only installed a few months ago, so we should be able to RMA the part pretty easily. Thanks, Willy

. @Jclark-ctr TSR report is attached

Confirmed: Service Request 1030121866 was successfully submitted.

@Marostegui Replacement Dimm has arrived please reach out to me for scheduling down time i am available for the next 2 hours but will be on site tomorrow 9:30am est

Thanks @Jclark-ctr - I am going to depool this hots so it is ready for when you arrive to the DC.

This being a backups source doesn't require depooling, but we need to check with @jcrespo when this host can be powered off.

Mentioned in SAL (#wikimedia-operations) [2020-07-22T07:40:52Z] <jynus> stop db1145 for hw maintenance T258249

Backups were taken from db1145 today and the host put down. Please ping here when maintenance is complete.

Change 615430 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Revert "db1145: Disable notifications"

https://gerrit.wikimedia.org/r/615430

Change 615430 merged by Jcrespo:
[operations/puppet@production] Revert "db1145: Disable notifications"

https://gerrit.wikimedia.org/r/615430

Everything looking good. Thanks, @Jclark-ctr !