Page MenuHomePhabricator

cp4032 memory error
Closed, ResolvedPublic

Description

While working on other things at ulsfo, I noticed that cp4032 has a memory error LED illuminated/alerting on the front panel LCD.

Logging into the drac, the service event log has:
-------------------------------------------------------------------------------
Record:      6
Date/Time:   12/01/2017 23:46:56
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B2.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   12/01/2017 23:54:04
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B2.
-------------------------------------------------------------------------------

Event Timeline

RobH triaged this task as Medium priority.Dec 18 2017, 7:46 PM
RobH created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 18 2017, 7:46 PM
RobH added a comment.Dec 18 2017, 7:51 PM

I've created another task, T183177 to track the fact this error wasn't shown in icinga. I've also depooled the system, and will be rebooting it into the Dell ePSA to attempt to get an error code there.

System will remain offline during this work.

Volans added a subscriber: Volans.Dec 19 2017, 9:06 AM

@RobH FYI I've ack'ed the Icinga alert of the host down and set it to downtime until Fri UTC morning.

ema added a subscriber: ema.Dec 19 2017, 1:42 PM

@RobH FYI I've ack'ed the Icinga alert of the host down and set it to downtime until Fri UTC morning.

I've just ack'ed all related strongswan alerts (cp and kafka hosts) too.

RobH added a comment.Dec 19 2017, 4:21 PM

Error codes from ePSA test:

Service Tag : 3ND3KH2
Error Code : 2000-0125
Validation : 107826

Dell info says that code means: The IPMI system event log is full for various reasons or logging has stopped because too many ECC errors have occurred.

RobH added a comment.Dec 19 2017, 4:29 PM

Yeah, it turns up nothing but the error codes for the actual failed dimm. It doesn't matter much, just helps for the part replacement.

SR958387090 is the self dispatch part # for the replacement dimm, it is shipping to ulsfo.

RobH added a comment.Dec 21 2017, 1:21 AM

FedEx 417953907699 delivered today. Emailed support@ul to notify them I'll pick it up tomorrow and install in cp4032.

RobH added a comment.EditedDec 21 2017, 7:48 PM

I've put in the new memory dimm and run memory tests, which are still running (and will take awhile.)

I'll check on the system remotely later today. Once its ready to go, I've pinged @BBlack asking for updated directions on returning a cp system to service from being offline for multiple days.

RobH reassigned this task from RobH to BBlack.Dec 21 2017, 9:57 PM

Ok, this is ready to go back online. new memory tested fine.

Mentioned in SAL (#wikimedia-operations) [2017-12-22T13:17:14Z] <bblack> repooling cp4032 - T183176

BBlack closed this task as Resolved.Dec 22 2017, 1:20 PM
BBlack moved this task from Triage to Hardware on the Traffic board.