Page MenuHomePhabricator

decommission mw1163
Closed, ResolvedPublic

Description

This task is to track approval and eventual decommission of mw1163. It crashed on 2017-09-05 due to a memory error. A check of the service event log via the ilom shows a LARGE number of memory failures in the same dimm slot.

This system has been out of warranty since 2016-01-30. The last repair on this system seems to have been T84399, which replaced memory and the system board. The SEL below doesn't include those failures, as it was cleared once new hardware was installed in the system.

The SEL shows:

Record:      1
Date/Time:   09/11/2014 18:55:40
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   10/19/2014 21:56:06
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B2.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   10/19/2014 21:58:43
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B2.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   12/13/2014 07:15:50
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B2.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   12/13/2014 07:47:15
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B2.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   02/07/2016 05:09:54
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B2.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   02/07/2016 05:09:55
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B2.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   03/19/2016 02:22:56
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_A2.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   03/19/2016 02:22:56
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_A2.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   04/12/2016 22:29:29
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   04/12/2016 22:29:29
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   04/12/2016 22:29:29
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   04/12/2016 22:29:29
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      14
Date/Time:   04/12/2016 22:29:29
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   04/12/2016 22:29:29
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A2.
-------------------------------------------------------------------------------
Record:      16
Date/Time:   04/14/2016 01:30:37
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B2.
-------------------------------------------------------------------------------
Record:      17
Date/Time:   04/14/2016 01:31:13
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B2.
-------------------------------------------------------------------------------
Record:      18
Date/Time:   06/30/2016 20:32:41
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B2.
-------------------------------------------------------------------------------
Record:      19
Date/Time:   06/30/2016 20:33:13
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B2.
-------------------------------------------------------------------------------
Record:      20
Date/Time:   11/02/2016 19:18:09
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B2.
-------------------------------------------------------------------------------
Record:      21
Date/Time:   11/03/2016 02:17:16
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B2.
-------------------------------------------------------------------------------
Record:      22
Date/Time:   11/18/2016 03:01:08
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B2.
-------------------------------------------------------------------------------
Record:      23
Date/Time:   11/18/2016 04:47:45
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B2.
-------------------------------------------------------------------------------
Record:      24
Date/Time:   07/06/2017 01:27:45
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_A2.
-------------------------------------------------------------------------------
Record:      25
Date/Time:   07/06/2017 01:27:45
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_A2.
-------------------------------------------------------------------------------
Record:      26
Date/Time:   09/05/2017 21:03:28
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
Record:      27
Date/Time:   09/05/2017 21:03:28
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      28
Date/Time:   09/05/2017 21:03:28
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      29
Date/Time:   09/05/2017 21:03:28
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      30
Date/Time:   09/05/2017 21:03:28
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      31
Date/Time:   09/05/2017 21:03:28
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A2.
-------------------------------------------------------------------------------

So this system has memory failures in both slots A2 and B2. Requesting permission to remove this system from service and decommission.

Related Objects

Event Timeline

This server is out of production and hopefully de-racked since a long time as it was part of an older batch of servers. @Cmjohnson can confirm, but this should be decommissioned per T177387