Page MenuHomePhabricator

db1114 crashed due to memory issues (server under warranty)
Closed, ResolvedPublic

Description

Looks like db1114 crashed due to memory issues.

Record:      13
Date/Time:   07/29/2019 23:22:26
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------
Record:      14
Date/Time:   07/29/2019 23:22:26
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B7.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   07/29/2019 23:24:34
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      16
Date/Time:   07/29/2019 23:24:34
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B7.
-------------------------------------------------------------------------------
Record:      17
Date/Time:   07/30/2019 17:56:21
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B7.
-------------------------------------------------------------------------------
Record:      18
Date/Time:   07/30/2019 17:56:21
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B7.
-------------------------------------------------------------------------------

Looks like DIMM_B7 is broken. Can we get Dell to send a new one?

Event Timeline

@Cmjohnson - just following up on this one, since you were out on vacation last week when the task came in.

Thanks,
Willy

@Marostegui I see a potential issue with B3 as well. I will need to do a DIMM swap A -> B side and see if the errors stay with the DIMM or are the CPU. Let's schedule this for early next week, please. Tuesday 1400UTC?

@Marostegui I see a potential issue with B3 as well. I will need to do a DIMM swap A -> B side and see if the errors stay with the DIMM or are the CPU. Let's schedule this for early next week, please. Tuesday 1400UTC?

Sounds good - thanks. I will have this host down before 14:00UTC on Tuesday so you can act on it.

Mentioned in SAL (#wikimedia-operations) [2019-08-20T05:59:13Z] <marostegui> Stop MySQL and shutdown db1114 for on-siste maintenance - T229452

@Cmjohnson this host is now OFF, so you can act on it whenever you get to the DC.
Thanks!

Swapped the DIMM B3 with A3 and B7 with A7. Powered on and cleared log. Let's see if the errors return or change,

Thank you Chris!
I have started MySQL, let's wait a few days before closing this, and if it happens again we can re-open!

The log is still clear. So closing this, if it happens again I will re-open

/admin1-> racadm getsel
Record:      1
Date/Time:   08/20/2019 14:46:17
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------

dmesg is clear too
Thank you Chris!

@Cmjohnson This host crashed again, and it was complaining again about B3, which was already swapped at T229452#5424883 , so it is a mainboard issue? Can we get a new one for this host?

-------------------------------------------------------------------------------
Record:      2
Date/Time:   09/04/2019 18:51:01
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   09/04/2019 18:52:59
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      4
Date/Time:   09/04/2019 18:52:59
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   09/04/2019 18:52:59
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      6
Date/Time:   09/04/2019 18:52:59
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   09/04/2019 18:52:59
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.
-------------------------------------------------------------------------------

@Cmjohnson or @Jclark-ctr - can one of you guys check this out early next week? Thanks, Willy

Any ETA on when the request will be sent to Dell?
Thanks!

The ticket was created with Dell. I am waiting on their approval and then for the Dell tech to coordinate a day/time to swap the board out

we're on the schedule to get the board swapped for 9/26

Cool, I will have the host down for you tomorrow.
Thanks for the heads up

Mentioned in SAL (#wikimedia-operations) [2019-09-26T07:09:55Z] <marostegui> Stop mysql on db1114 for mainboard replacement - T229452

Mentioned in SAL (#wikimedia-operations) [2019-09-26T07:10:40Z] <marostegui> Power off db1114 for mainboard replacement T229452

@Cmjohnson db1114 is now off, so the mainboard can be replaced anytime.

@Marostegui the tech couldn’t make it in time yesterday and were scheduled
today 1100-1300 local time.

@Marostegui the tech couldn’t make it in time yesterday and were scheduled
today 1100-1300 local time.

Great! The host is OFF, so you can work on it anytime today.

I see this host is back up, so I guess the mainboard has been replaced?

@Marostegui yes the board was replaced. Sorry about that I left that to John and the task was not closed.