Server mw1280 mysteriously crashed on Mar 10 06:38:11
Description
Related Objects
Event Timeline
The server has broken memory (and warranty expires in a month):
Record: 43 Date/Time: 03/10/2019 07:53:15 Source: system Severity: Critical Description: Correctable memory error rate exceeded for DIMM_B1.
Mentioned in SAL (#wikimedia-operations) [2019-03-13T18:51:35Z] <jijiki> Depool mw1280 and mw2206 to hardware issues - T215415 T218006
Record: 42
Date/Time: 03/10/2019 07:43:40
Source: system
Severity: Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
Record: 43
Date/Time: 03/10/2019 07:53:15
Source: system
Severity: Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
The error didn't appear again (yet) but I created a task with Dell worst case they push back...best they send a DIMM. We're less than 30 days from end of warranty.
Mentioned in SAL (#wikimedia-operations) [2019-03-14T17:15:14Z] <jijiki> Pool mw1280 back - T218006
This server crashed again:
------------------------------------------------------------------------------- Record: 2 Date/Time: 04/13/2019 12:33:55 Source: system Severity: Critical Description: CPU 2 has an internal error (IERR). ------------------------------------------------------------------------------- Record: 3 Date/Time: 04/13/2019 12:36:59 Source: system Severity: Ok Description: A problem was detected related to the previous server boot. ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- Record: 4 Date/Time: 04/13/2019 12:36:59 Source: system Severity: Critical Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A1. ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- Record: 5 Date/Time: 04/13/2019 12:36:59 Source: system Severity: Critical Description: CPU 1 machine check error detected. ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- Record: 27 Date/Time: 04/13/2019 12:37:00 Source: system Severity: Critical Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1. ------------------------------------------------------------------------------- Record: 28 Date/Time: 04/13/2019 12:37:00 Source: system Severity: Ok Description: An OEM diagnostic event occurred. -------------------------------------------------------------------------------
I'm setting it to inactive while we know how the request to dell goes. @Cmjohnson let us know when you know more.
On second thoughts, this is an API server, of which we have a just a few right now.
I'll avoid depooling it if not strictly necessary.
@Joe Good news is I have already ordered the DIMM from the previous failure
and it's on-site. I can do this tomorrow afternoon (my time) if you can
depool it then.
@Cmjohnson I'm on US East time and can handle the depool. Give me a ping when you're ready
13:26:19 <+logmsgbot> !log cdanis@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,name=mw1280.eqiad.wmnet,cluster=api_appserver
I replaced both DIMM A1 and B1 since I had previously ordered one for mw1264 that I did not need. Please add back to but I have a feeling that a CPU may be bad. Let's leave this open for a week and see if an errors return.
17:02:39 <+logmsgbot> !log cdanis@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1280.eqiad.wmnet