Page MenuHomePhabricator

cp4026 correctable dimm error
Closed, ResolvedPublic

Description

When @RobH was onsite he noticed the orange LCD error message on cp4026:

Record:      25
Date/Time:   12/22/2018 04:13:13
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------
Record:      26
Date/Time:   12/22/2018 05:09:48
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------

The LCD reads "Correctable memory error rate exceeded for DIMM_B3. Reseat memory."

Event Timeline

RobH triaged this task as Medium priority.Jan 23 2019, 8:06 PM
RobH created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

So, everything I see on wikitech supports that I can offline this single host at any time to do the work on reseating the dimm. However, it is the week before all hands, and many folks are travelling. When asking in IRC, I get no answers on if this is ok to do.

I'm assigning to @BBlack for feedback.

Brandon: What do we need to do to offline this host? Is it as simple as a clean shutdown and letting pybal depool it automatically, or should we follow directions on https://wikitech.wikimedia.org/wiki/Cache_servers#Depool_and_downtime ?

Once host is shutdown, I'll reseat dimm B3, clear the SEL, and power it back up!

https://wikitech.wikimedia.org/wiki/Cache_servers#Depool_and_downtime is correct, it just needs to be depooled (it will auto-depool on shutdown, but a manual depool is preferable).

See also T178011 for last time. Why didn't the icinga EDAC check catch this?

Mentioned in SAL (#wikimedia-operations) [2019-02-06T19:13:39Z] <robh> taking cp4026 offline to flash firmware and reseat dimm for testing on T214516

robh@cp4026:~$ sudo ipmi-sel
ID  | Date        | Time     | Name             | Type                     | Event
1   | Apr-23-2017 | 23:39:37 | SEL              | Event Logging Disabled   | Log Area Reset/Cleared
2   | May-03-2017 | 15:45:10 | PS Redundancy    | Power Supply             | Fully Redundant
3   | Sep-21-2017 | 03:57:01 | Status           | Power Supply             | Power Supply input lost (AC/DC)
4   | Sep-21-2017 | 03:57:01 | PS Redundancy    | Power Supply             | Redundancy Lost
5   | Sep-21-2017 | 04:28:01 | Status           | Power Supply             | Power Supply input lost (AC/DC)
6   | Sep-21-2017 | 04:28:11 | PS Redundancy    | Power Supply             | Fully Redundant
7   | Oct-05-2017 | 07:20:45 | Mem ECC Warning  | Memory                   | transition to Non-Critical from OK ; OEM Event Data2 code = A0h ; OEM Event Data3 code = 01h
8   | Oct-05-2017 | 07:30:31 | Mem ECC Warning  | Memory                   | transition to Critical from less severe ; OEM Event Data2 code = A0h ; OEM Event Data3 code = 01h
9   | Oct-12-2017 | 00:03:42 | Status           | Power Supply             | Power Supply input lost (AC/DC)
10  | Oct-12-2017 | 00:03:42 | PS Redundancy    | Power Supply             | Redundancy Lost
11  | Oct-12-2017 | 00:05:07 | Status           | Power Supply             | Power Supply input lost (AC/DC)
12  | Oct-12-2017 | 00:05:17 | Status           | Power Supply             | Power Supply input lost (AC/DC)
13  | Oct-17-2017 | 18:01:23 | Status           | Power Supply             | Power Supply input lost (AC/DC)
14  | Oct-17-2017 | 18:01:28 | PS Redundancy    | Power Supply             | Fully Redundant
15  | Oct-17-2017 | 21:32:43 | Intrusion        | Physical Security        | General Chassis Intrusion ; OEM Event Data2 code = 02h
16  | Oct-17-2017 | 21:32:48 | Intrusion        | Physical Security        | General Chassis Intrusion ; OEM Event Data2 code = 02h
17  | Dec-18-2018 | 18:02:36 | PS Redundancy    | Power Supply             | Redundancy Lost
18  | Dec-18-2018 | 18:02:36 | Status           | Power Supply             | Power Supply input lost (AC/DC)
19  | Dec-18-2018 | 18:05:26 | Status           | Power Supply             | Power Supply input lost (AC/DC)
20  | Dec-18-2018 | 18:05:36 | PS Redundancy    | Power Supply             | Fully Redundant
21  | Dec-18-2018 | 18:14:47 | Status           | Power Supply             | Power Supply input lost (AC/DC)
22  | Dec-18-2018 | 18:14:52 | PS Redundancy    | Power Supply             | Redundancy Lost
23  | Dec-18-2018 | 18:19:07 | Status           | Power Supply             | Power Supply input lost (AC/DC)
24  | Dec-18-2018 | 18:19:12 | PS Redundancy    | Power Supply             | Fully Redundant
25  | Dec-22-2018 | 04:13:13 | Mem ECC Warning  | Memory                   | transition to Non-Critical from OK ; OEM Event Data2 code = A1h ; OEM Event Data3 code = 04h
26  | Dec-22-2018 | 05:09:48 | Mem ECC Warning  | Memory                   | transition to Critical from less severe ; OEM Event Data2 code = A1h ; OEM Event Data3 code = 04h
robh@cp4026:~$

Mentioned in SAL (#wikimedia-operations) [2019-02-06T19:50:53Z] <robh> updated firmware on cp4026 and re-seated (already well seated) dimm b3. errors have cleared for now T214516

Ok, things I did to fix this system so far:

  • set system and services/mgmt to maint mode for 2 hours
  • updated task with full SEL log output
  • powered off system
  • updated firmware for bios, includes all memory handing, required for opening Dell support cases
  • ensured dimm b3 was seated properly (it was)
  • wiped rac log and returned to service

If this has further errors, there is now this task history to track it.