Page MenuHomePhabricator

cp4032 memory error
Closed, ResolvedPublic

Description

While working on other things at ulsfo, I noticed that cp4032 has a memory error LED illuminated/alerting on the front panel LCD.

Logging into the drac, the service event log has:
-------------------------------------------------------------------------------
Record:      6
Date/Time:   12/01/2017 23:46:56
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B2.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   12/01/2017 23:54:04
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B2.
-------------------------------------------------------------------------------

Event Timeline

RobH triaged this task as Medium priority.

I've created another task, T183177 to track the fact this error wasn't shown in icinga. I've also depooled the system, and will be rebooting it into the Dell ePSA to attempt to get an error code there.

System will remain offline during this work.

@RobH FYI I've ack'ed the Icinga alert of the host down and set it to downtime until Fri UTC morning.

@RobH FYI I've ack'ed the Icinga alert of the host down and set it to downtime until Fri UTC morning.

I've just ack'ed all related strongswan alerts (cp and kafka hosts) too.

Error codes from ePSA test:

Service Tag : 3ND3KH2
Error Code : 2000-0125
Validation : 107826

Dell info says that code means: The IPMI system event log is full for various reasons or logging has stopped because too many ECC errors have occurred.

Yeah, it turns up nothing but the error codes for the actual failed dimm. It doesn't matter much, just helps for the part replacement.

SR958387090 is the self dispatch part # for the replacement dimm, it is shipping to ulsfo.

FedEx 417953907699 delivered today. Emailed support@ul to notify them I'll pick it up tomorrow and install in cp4032.

I've put in the new memory dimm and run memory tests, which are still running (and will take awhile.)

I'll check on the system remotely later today. Once its ready to go, I've pinged @BBlack asking for updated directions on returning a cp system to service from being offline for multiple days.

Ok, this is ready to go back online. new memory tested fine.

BBlack moved this task from Backlog to Hardware on the Traffic board.