Page MenuHomePhabricator

cp4009 hardware fault
Closed, ResolvedPublic

Description

When attempting "racadm serveraction powercycle" for a reinstall, the command returned ERROR: Timeout while waiting for server to perform requested power action. and left the machine in a powered-off state. I wasn't able to revive via e.g. racreset or other power commands like hardreset or powerup. The System Event Log show the following (although it's hard to believe, given the system was running fine just before the powercycle). I suppose it's possible the powercycle itself caused a component to fail:

-------------------------------------------------------------------------------
Record:      9
Date/Time:   03/12/2015 06:08:53
Source:      system
Severity:    Ok
Description: Unknown Event
-------------------------------------------------------------------------------
Record:      10
Date/Time:   03/12/2015 06:08:53
Source:      system
Severity:    Critical
Description: CPU 2 M23 VTT PG voltage is outside of range.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   03/12/2015 06:09:00
Source:      system
Severity:    Critical
Description: The system board fail-safe voltage is outside of range.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   03/12/2015 05:14:12
Source:      system
Severity:    Critical
Description: The system board fail-safe voltage is outside of range.
-------------------------------------------------------------------------------

Event Timeline

BBlack raised the priority of this task from to Medium.
BBlack updated the task description. (Show Details)
BBlack added projects: acl*sre-team, ops-ulsfo.
BBlack subscribed.

Change 196157 had a related patch set uploaded (by BBlack):
cp1046/cp4009 -> jessie; depool cp4009 T92476

https://gerrit.wikimedia.org/r/196157

Change 196157 merged by BBlack:
cp1046/cp4009 -> jessie; depool cp4009 T92476

https://gerrit.wikimedia.org/r/196157

I went onsite today and removed all power from the system (full cord removal) and it didn't resolve the issue. (Was a last ditch effort, since I was on site anyhow.)

I need to get setup with the Dell self dispatch anyhow, so I'll process this replacement later today/tomorrow. (I have to take their self dispatch test.)

Work Order submitted for a system board replacement. Once approved. The task will be updated

I don't want to meet some dude at the datacenter. He called and it seems this was not a parts only dispatch, and he cannot just ship us the part.

Please redo this warranty board replacement to ship the item, without a tech, to 200 paul.

Assigning to @Cmjohnson since he is the only one of us certified to self-dispatch parts.

Reasoning for parts only dispatch: I don't feel like wasting half a day onsite waiting on, then working with, a random dell certified tech when its simply a mainboard swap.

Since we don't need the system online ASAP, and waiting a day or two longer for a mainboard shipment seems fine, I rather just do parts only. I advised this at the time of the processing, but I think it wasn't relayed, since I didn't put it in the task (sorry about that).

Congratulations: Work Order WO6747229 was successfully submitted.

This order arrived at 200 Paul today.

had this sent and assigning to Rob

RobH added a parent task: T93640: re-deploy cp4009.

The mainboard has been replaced, and I'm taking the defective mainboard past a FedEx location to drop off for shipping.

  • updated service tag on mainboard to match
  • updated idrac license and tested console redirection via mgmt

IMG_20150323_140431095.jpg (4×2 px, 3 MB)

dropped off mainboard at fedex, proof of return.

Change 199135 had a related patch set uploaded (by BBlack):
repool cp4009 T92476

https://gerrit.wikimedia.org/r/199135

server is reinstalled and repooled, seems to be functional! leaving this open in case others not done tracking hw-related bits w/ Dell.