Page MenuHomePhabricator

cp1047 down
Closed, ResolvedPublic

Description

cp1047 - Host Status:

DOWN

(for 1d 1h 42m 3s)

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=cp1047&nostatusheader

it was down since yesterday per icinga and no comments, tickets or SAL that i could see

Event Timeline

Dzahn raised the priority of this task from to Needs Triage.
Dzahn updated the task description. (Show Details)
Dzahn subscribed.
Dzahn set Security to None.
Phoenix ROM BIOS PLUS Version 1.10 1.6.0
Copyright 1985-1988 Phoenix Technologies Ltd.
Copyright 1990-2012 Dell Inc.
All Rights Reserved

Dell System PowerEdge R620
www.dell.com

The amount of system memory has changed.


Two 2.70 GHz Eight-core Processors, Bus Speed:8.00 GT/s, L2/L3 Cache:2 MB/20 MB
System running at 2.70 GHz



System Memory Size: 192.0 GB, System Memory Speed: 1600 MHz, Voltage: 1.35V

MEMBISTMMemoryiTestafailurenDIMMnA5g detected.




Dell Inc. PERC S110 Controller BIOS (Version: 3.0.0-0139)
Pressi<CTRL-R>0to-Configure.H-ll Systems Corp.



* BIOS defaults restored. *
 2-Non-RAID,372GB, Normal





PressiCtrl-Setorenter1ConfigurationrMenution


CPLD version : 103




Management Engine Mode                : Active
MStrikeetheEF1ikeyFtomcontinue,dF2 to run0the0system setup program

MEMBISTMMemoryiTestafailurenDIMMnA5g detected.

shutting back down. please check RAM ^

Dzahn triaged this task as Medium priority.Jan 30 2015, 1:35 AM
Dzahn added projects: ops-eqiad, acl*sre-team.
gerritbot subscribed.

Change 187821 had a related patch set uploaded (by BBlack):
cp1047 -> out for hardware T88045

https://gerrit.wikimedia.org/r/187821

Patch-For-Review

Change 187821 merged by BBlack:
cp1047 -> out for hardware T88045

https://gerrit.wikimedia.org/r/187821

Adding the idrac system even log entries

Date/Time: 06/06/2013 02:33:43
Source: system
Severity: Ok

Description: Log cleared.

Record: 2
Date/Time: 08/01/2013 19:22:37
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_A5.

Record: 3
Date/Time: 08/01/2013 19:22:39
Source: system
Severity: Critical

Description: Persistent correctable memory error limit reached for a memory device at location(s) DIMM1,DIMM2,DIMM3,DIMM4,DIMM5,DIMM6,DIMM7,DIMM8.

Record: 4
Date/Time: 08/01/2013 19:23:01
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_A5.

Swapped DIMM A5 with B5, cleared the system event log. Will monitor to see what if any error returns.
/admin1-> racadm getsel
Record: 1
Date/Time: 03/09/2015 17:57:42
Source: system
Severity: Ok

Description: Log cleared.

Cmjohnson subscribed.

Checked for errors again today and did not see any. Rebooted and nothing shows in post. Assigning to bblack to add back.

Just rebooted into bios setup to enable HT, then rebooted for PXE, and saw:

Error: Memory initialization warning detected.

MEMBIST Memory Test failure DIMM B5

Then it halts to ask for F1 to continue, F2 for setup, etc...

I think this means the stick moved from A5 to B5 is actually bad.

I would agree the stick is bad. contacting Dell and will update the phab task with shipping information

Congratulations: Work Order WO6747226 was successfully submitted.

If you need to make any changes to the dispatch contact information, please visit our Support Center or Click Here to chat with a live support representative.
For expedited service to our premium tech agents please use Express Service Code when calling Dell. The Express Service Code is located under your Portables or on the back of desktop.
You may also check for updates via our Online Statuspage.
Please see below for important information.

Dispatch Information
Dispatch Number: 176706686 Work Order Number: WO6747226
Waybill Number: 503031400297
Service Tag: C75YFX1 PO/Reference: cp1047

Dell sent a new DIMM but that was also bad.

  • Replaced DIMM b5 with new DIMM, rebooted and same error appeared during post

MEMBIST Memory Test failure DIMM B5

  • Swapped DIMM B5 to B1 and rebooted and during post

MEMBIST Memory Test failure DIMM B1

Returning the bad DIMM either USPS or FEDEX - Documenting for future reference as our shipments have been lost in the past.
FEDEX Tracking # 9611918 2393026 47599924 or USPS 9202 3946 5301 2426 1088 14

Received the new DIMM and replaced the bad one that I had inserted in slot B1. Rebooted and no errors show.

Sending the other bad DIMM back. Tracking Numbers are

Fedex
9611918 2393026 47648516

USPS
9202 3946 5301 2426 1574 09