Page MenuHomePhabricator

mw1280 crashed
Closed, ResolvedPublic

Description

Server mw1280 mysteriously crashed on Mar 10 06:38:11

Event Timeline

jijiki triaged this task as Normal priority.Mar 11 2019, 7:03 AM
jijiki created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 11 2019, 7:03 AM
jijiki updated the task description. (Show Details)Mar 11 2019, 7:03 AM

I power cycled it from the idrac (it was totally stuck)

The server has broken memory (and warranty expires in a month):

Record:      43
Date/Time:   03/10/2019 07:53:15
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.

@MoritzMuehlenhoff can you please depool the server

@Cmjohnson Server has been depooled, ping us to pool it back, tx!

Stashbot added a subscriber: Stashbot.

Mentioned in SAL (#wikimedia-operations) [2019-03-13T18:51:35Z] <jijiki> Depool mw1280 and mw2206 to hardware issues - T215415 T218006

Record: 42
Date/Time: 03/10/2019 07:43:40
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_B1.

Record: 43
Date/Time: 03/10/2019 07:53:15
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_B1.

Swapped DIMM B1 with A1 cleared idrac log.

The error didn't appear again (yet) but I created a task with Dell worst case they push back...best they send a DIMM. We're less than 30 days from end of warranty.

Mentioned in SAL (#wikimedia-operations) [2019-03-14T17:15:14Z] <jijiki> Pool mw1280 back - T218006

This server crashed again:

-------------------------------------------------------------------------------
Record:      2
Date/Time:   04/13/2019 12:33:55
Source:      system
Severity:    Critical
Description: CPU 2 has an internal error (IERR).
-------------------------------------------------------------------------------
Record:      3
Date/Time:   04/13/2019 12:36:59
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Record:      4
Date/Time:   04/13/2019 12:36:59
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Record:      5
Date/Time:   04/13/2019 12:36:59
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Record:      27
Date/Time:   04/13/2019 12:37:00
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
-------------------------------------------------------------------------------
Record:      28
Date/Time:   04/13/2019 12:37:00
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Joe added a subscriber: Joe.Apr 15 2019, 6:20 AM

I'm setting it to inactive while we know how the request to dell goes. @Cmjohnson let us know when you know more.

Joe added a comment.Apr 15 2019, 6:23 AM

On second thoughts, this is an API server, of which we have a just a few right now.

I'll avoid depooling it if not strictly necessary.

@Joe Good news is I have already ordered the DIMM from the previous failure
and it's on-site. I can do this tomorrow afternoon (my time) if you can
depool it then.

CDanis added a subscriber: CDanis.Apr 15 2019, 2:29 PM

@Cmjohnson I'm on US East time and can handle the depool. Give me a ping when you're ready

13:26:19 <+logmsgbot> !log cdanis@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,name=mw1280.eqiad.wmnet,cluster=api_appserver

I replaced both DIMM A1 and B1 since I had previously ordered one for mw1264 that I did not need. Please add back to but I have a feeling that a CPU may be bad. Let's leave this open for a week and see if an errors return.

17:02:39 <+logmsgbot> !log cdanis@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1280.eqiad.wmnet

CDanis reassigned this task from Cmjohnson to jijiki.Apr 16 2019, 9:04 PM
CDanis added a subscriber: Cmjohnson.
jijiki closed this task as Resolved.Apr 16 2019, 9:05 PM

@CDanis Thank you! I am resolving this for now.