Page MenuHomePhabricator

ms-be1033 down and not powering up
Closed, ResolvedPublic

Description

This host went down about 8h ago and I can't power it back up from ilo. @Cmjohnson thoughts on this? The host is in production and we can go by without it for a 2 max 3 days. If the host will be offline for longer I'll need to remove it from swift though.

</>hpiLO-> power on
                   
status=0
status_tag=COMMAND COMPLETED
Wed Feb 13 07:52:45 2019
                        


Server powering on .......



</>hpiLO-> power
                
status=0
status_tag=COMMAND COMPLETED
Wed Feb 13 07:52:48 2019
                        


power: server power is currently: Off


</>hpiLO-> power
                
status=0
status_tag=COMMAND COMPLETED
Wed Feb 13 07:52:57 2019
                        


power: server power is currently: Off


</>hpiLO-> power
                
status=0
status_tag=COMMAND COMPLETED
Wed Feb 13 07:53:44 2019
                        


power: server power is currently: Off


</>hpiLO->

Event Timeline

fgiunchedi triaged this task as High priority.

I physically cannot turn the server on either, I tried pulling the power and waiting 10 minutes but I just get a flashing green indicator at the power button. I am able to access the ilo's web interface but that does not tell me anything. In the past, a motherboard replacement was needed. I will update the ticket with HPE response

A ticket has been opened with HPE

Case ID: 5336351338
Case title:
Failed Mother Board
Severity 3-Normal
Product serial number: MXQ70601RN
Product number: 719061-B21
Submitted: 2/13/2019 1:03:08 PM
Last updated: 2/13/2019 1:03:08 PM
Source: Web
Case status: Received by HPE

Thanks @Cmjohnson ! Did HP provide an ETA for shipment/resolution ?

@fgiunchedi HPE did not believe me that a motherboard swap is needed. They asked that I do a bunch of troubleshooting first. Below are the steps they asked me to do. I have replied that their wild goose chase did not work and to please send me a new board. fingers crossed.

Action Plan :
1> Clear NVRAM
Restore to manufacture settings / clearing NVRAM.

These steps will clear the NVRAM.
a. Shut down the server and disconnect all the power supplies. (done)
b. Move the switch S6 (of system maintenance switch) to ON position. (I moved to S6 and powered on)
c. Power on the server and check for the message that the NVRAM has been cleared (same result)
d. Shut down the server and move S6 to the original position which is OFF position. (done)
e. Power on the server.

2> If issue persist we need to power on the server with minimal config,
Power On the server with 1 DIMM per processor and disconnect all the external device. (done, taken down to 1 DIMM, 1 CPU and disconnected everything that was not necessary to get to post.)

Once the server powers ON
A> Update the server firmware using latest SPP.( The server did not power on)

The server did eventually power up, so it looks like I am eating some crow on this one. Re-connected everything and put back to normal operating standard. Booting into the OS now.

CDanis added a subscriber: CDanis.Feb 21 2019, 7:02 PM

Thanks @Cmjohnson for digging into it! AFAICS the host came back fine, I guess we'll wait and see if it happens again. In the meantime I'm fine with stalling/resolving this task if that works for you.

Cmjohnson closed this task as Resolved.Feb 22 2019, 5:23 PM

Resolving, feel free to open if the problem returns