Page MenuHomePhabricator

mw1280 crashed
Closed, ResolvedPublic

Description

Server mw1280 mysteriously crashed on Mar 10 06:38:11

Event Timeline

jijiki triaged this task as Medium priority.Mar 11 2019, 7:03 AM
jijiki created this task.

I power cycled it from the idrac (it was totally stuck)

MoritzMuehlenhoff subscribed.

The server has broken memory (and warranty expires in a month):

Record:      43
Date/Time:   03/10/2019 07:53:15
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.

@Cmjohnson Server has been depooled, ping us to pool it back, tx!

Record: 42
Date/Time: 03/10/2019 07:43:40
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_B1.

Record: 43
Date/Time: 03/10/2019 07:53:15
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_B1.

Swapped DIMM B1 with A1 cleared idrac log.

The error didn't appear again (yet) but I created a task with Dell worst case they push back...best they send a DIMM. We're less than 30 days from end of warranty.

This server crashed again:

-------------------------------------------------------------------------------
Record:      2
Date/Time:   04/13/2019 12:33:55
Source:      system
Severity:    Critical
Description: CPU 2 has an internal error (IERR).
-------------------------------------------------------------------------------
Record:      3
Date/Time:   04/13/2019 12:36:59
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Record:      4
Date/Time:   04/13/2019 12:36:59
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Record:      5
Date/Time:   04/13/2019 12:36:59
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Record:      27
Date/Time:   04/13/2019 12:37:00
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
-------------------------------------------------------------------------------
Record:      28
Date/Time:   04/13/2019 12:37:00
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------

I'm setting it to inactive while we know how the request to dell goes. @Cmjohnson let us know when you know more.

On second thoughts, this is an API server, of which we have a just a few right now.

I'll avoid depooling it if not strictly necessary.

@Joe Good news is I have already ordered the DIMM from the previous failure
and it's on-site. I can do this tomorrow afternoon (my time) if you can
depool it then.

@Cmjohnson I'm on US East time and can handle the depool. Give me a ping when you're ready

13:26:19 <+logmsgbot> !log cdanis@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,name=mw1280.eqiad.wmnet,cluster=api_appserver

I replaced both DIMM A1 and B1 since I had previously ordered one for mw1264 that I did not need. Please add back to but I have a feeling that a CPU may be bad. Let's leave this open for a week and see if an errors return.

17:02:39 <+logmsgbot> !log cdanis@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1280.eqiad.wmnet

@CDanis Thank you! I am resolving this for now.