Page MenuHomePhabricator

ms-be2030 spontaneous reboot
Closed, ResolvedPublic

Description

ms-be2030 was completely unresponsive over the console this last weekend. I forced power cycle earlier this morning and it spontaneously rebooted about 9h later:

08:35 -icinga-wm:#wikimedia-operations- RECOVERY - Host ms-be2030 is UP: PING OK - Packet loss = 0%, RTA = 36.63 ms
17:08 -icinga-wm:#wikimedia-operations- PROBLEM - Host ms-be2030 is DOWN: PING CRITICAL - Packet loss = 100%
17:10 -icinga-wm:#wikimedia-operations- RECOVERY - Host ms-be2030 is UP: PING OK - Packet loss = 0%, RTA = 39.22 ms

Nothing from ipmi-sel though a spontaneous reboot is obviously troubling. @Papaul have you seen this before?

Event Timeline

@fgiunchedi nerver seen this on HP servers. I also checked the log it just says "server reset" no other reason. Since the server is under warranty, I can open a case with HP and submit them the AHS logs they can provide me with more information after they review the log.
If they don't find anything, they will ask me o upgrade the server firmware.

@fgiunchedi I open a case with HP and for now They said there is no engineer available to help me, I will be receiving a call back in an hour. Please see below for case information.

Dear Mr Papaul,

Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.

Your request is being worked on under reference number 5332560753
Status: Case is generated and in Progress

Product description: HPE ProLiant DL380 Gen9 12LFF Configure-to-order Server
Product number: 719061-B21
Serial number: MXQ70601RV
Subject: DL380 Gen9 - Server rebooted - Fault analysis required

Engineer called at 2:20 PM CDT called ended at 2:37 PM CDT. He went over the AHS logs and didn't find any error or issue. His recommendation as I mentioned was to upgrade the BIOS and the controller. @fgiunchedi let me know.

Hello Sir ,
Greetings from HPE ,

As discussed over the call please refer to the below mentioned details

Link to BIOS : https://support.hpe.com/hpsc/swd/public/detail?sp4ts.oid=1009087943&swItemId=MTX_ea86c71918094b7781bff4b124&swEnvOid=4184

BIOS is OS independent hen e do not panic to see the OS as Windows

You can directly afdate the BIOS from teh ILO COnsole

Download teh File
Extract it in a folder "BIOS 2018"
Go to ILO Home page >> Administrator >> Firmware >> Browse >> BIOS 2018 >> and select the largest file in the directory

Link to Controllers Firmware : https://support.hpe.com/hpsc/swd/public/detail?sp4ts.oid=7274906&swItemId=MTX_94d23918a511422fa46c3f49f7&swEnvOid=4184

Controllers firmware is tested on RHEL hence you might face some issue while installing this , you can always give us a call back for any instant support
kindly refer to the installation instruction tab on the Link for your personal reference

Thanks & Regards,
Prakash Singh
Technical Solutions Consultant, HPE Enhanced Support
Hewlett Packard Enterprise | CSC, Bangalore
Email: prakash.tri.singh@hpe.com
You can reach our support staff at:

@Papaul thanks! indeed looks like the server just reset. I'm skeptical upgrading bios/controller will help but no harm in trying either, let me know when you are online later today and I'll shut the host.

Console unresponsive, nothing in show /system1/log1/, no ping. As agreed on IRC I'm leaving it down for now until @fgiunchedi comes back from lunch to allow further investigation.

Also power commands from ilo don't seem to respond timely or work at all:

</>hpiLO-> power off
                    
status=0
status_tag=COMMAND COMPLETED
Tue Sep 18 12:09:16 2018
                        


Server powering off .......



</>hpiLO-> power
                
status=0
status_tag=COMMAND COMPLETED
Tue Sep 18 12:09:19 2018
                        


power: server power is currently: On


</>hpiLO-> power
                
status=0
status_tag=COMMAND COMPLETED
Tue Sep 18 12:09:33 2018
                        


power: server power is currently: Resetting


</>hpiLO-> power
                
status=0
status_tag=COMMAND COMPLETED
Tue Sep 18 12:09:43 2018
                        


power: server power is currently: Resetting

</>hpiLO-> power
                
status=0
status_tag=COMMAND COMPLETED
Tue Sep 18 12:10:28 2018
                        


power: server power is currently: On

The host is back up now, clearly not stable enough for production though.

@fgiunchedi getting "The last firmware update attempt was not successful. Ready for the next update." error when trying to update the BIOS. so I email the HP engineer and waiting for response.

Since using the ILO to upgrade the BIOS is not working, I will have to make a new Service Pack disk to upgrade the firmware on the server. the SP disk that i have on site right now is old.

@fgiunchedi HW diagnostics didn't report any problem. Upgrade all firmware on the server . we can observe the server all this week and if we do have the same problem again, I will contact HP.

The server reset again. This time I have something in the log file. see below

Selection_038.png (104×1 px, 18 KB)

Called HP at 17:48 CDT call ended at 18:25 CDT. went over again the new Log from today and didn't see any internal problem on the server. The only thing he asked me to do it to change the power settings of the server from Dynamic power saving mode to Static High performance mode. The error reported on the integrated management log according (Critical Temperature Threshold Exceeded) to him is an external issue.

MoritzMuehlenhoff triaged this task as Medium priority.

@fgiunchedi Since 9-20-18 after making the power settings change, I have been monitoring the server; so far the server has been up with no reset. Is it possible to put back the server in production so it can get some load and I will be monitoring it and report back to HP.

Thanks

Thanks @Papaul, the server has been in production the whole time and indeed no power resets so far.

@fgiunchedi thank you will leave this task open until the end of the week.

I checked again the server logs, everything looks good. Resolving this task.

I'm reopening the task, the server went down again today:

[13:56] <icinga-wm> PROBLEM - Host ms-be2030 is DOWN: PING CRITICAL - Packet loss = 100%

I was looking at system logs and while I was doing that the server came back up:

[14:12] <icinga-wm> RECOVERY - Host ms-be2030 is UP: PING OK - Packet loss = 0%, RTA = 37.99 ms

The system logs indicate a heating error:

/system1/log1/record16
  Targets
  Properties
    number=16
    severity=Critical
    date=02/04/2019
    time=14:03
    description=Critical Temperature Threshold Exceeded
                
/system1/log1/record17
  Targets
  Properties
    number=17
    severity=Caution
    date=02/04/2019
    time=14:11
    description=Option ROM POST Error: 1719-Slot 3 Drive Array - A controller failure event occurred prior to this power-up. (Previous lock up code = 0xAB) Action: Install the latest controller firmware. If the problem persists, replace the controller.

Checked temperature in the rack all looks good. add blanks to the rack since we have only 8 servers in that rack. Leaving the task open for another week.

@fgiunchedi is it possible to depool this server for me to do a firmware upgrade before I resolve the task?

@fgiunchedi is it possible to depool this server for me to do a firmware upgrade before I resolve the task?

Yes, a clean shutdown of the host is enough, let me know later today when you are online and I'll do the poweroff.

@fgiunchedi we can do this tomorrow if thats okay with you. Thanks.

@fgiunchedi we can do this tomorrow if thats okay with you. Thanks.

Sounds good to me -- ping me on IRC when good to go!

Firmware upgrade complete. Resolving this task for now.