Page MenuHomePhabricator

troubleshoot drac on ms-be2010.codfw.wmnet
Closed, DeclinedPublic

Description

System ms-be2010.codfw.wmnet won't respond to IPMI commands, but it will allow drac connections via ssh.

Since IPMI interfaces with the power controls, and drac is known to get into bad states, the first step will be total power removal and power-cycling the system.

Once that is done, attempt to run the command: sudo ipmi-chassis --get-chassis-status

If it gives an error, then it is still not working. Advise flashing firmware to the newest revision and reattempting before we open a case with Dell.

Event Timeline

Host can be taken down at any time with a clean shutdown to make sure all services are stopped

system host has same issue:

robh@ms-be2010:~$ sudo ipmi-chassis --get-chassis-status
ipmi_cmd_get_chassis_status: internal system error

I'd recommend these next steps:

  • manually confirm ipmi is enabled in drac bios
  • if it was, flash firmware of drac/bios
  • reattempt test

Cannot flash firmware of drac/bios gettng error message when trying to update the drac/bios.
Resetting the IDRAC to factory hands at 1% for 45 minutes have to cancel. I will resume troubleshooting tomorrow when back on site.

This system also has expired IDRAC license . it looks like all the ms-be servers purchase in 2012 will be having the same problem soon
ms-be2001-ms-be2012

It is odd we cannot flash the firmware at all. This is overall a major issue.

it is not allowing to upload the firmware at all.

The license should not expire, that is strange. I've downloaded it from Dell's license management site:

- iDRAC7 Enterprise,Perpetual,Digital License only

Please update the license, and see if that fixes the updating issues.

@RobH I update the license and try to update the firmware having the same error. The image I am using is the same image I used on ms-be2002.

Selection_001.png (448×1 px, 26 KB)

This comment was removed by RobH.

That is indeed odd, since they are both R720xd. Unfortunately, the system is no longer covered under warranty with Dell, so we cannot contact their support for assistance on it directly.

I've chatted with @Papaul about this system, and the following troubleshooting steps have been taken (@Papaul, please correct me if any of this is mistaken):

  • system had all power cables removed (by @Papaul) for a full system reset of the drac - did not resolve ipmi issue or firmware update issue, as it was the FIRST step taken when Papaul took over this task for troubleshooting.
  • system fails to take the updated firmware drac/ilom listed on http://www.dell.com/support/home/us/en/04/product-support/servicetag/2FMLYV1/drivers
    • both @Papaul and @RobH tried to update this independently of one another (we both did our own download of the file, which worked on other R72xd.)
  • system fails to enable IPMI

So this host won't allow us to update the firmware, and it won't let us enable IPMI. It otherwise seems to function, but will block fleet-wide adoption of IPMI.

Next steps:

  • Decide if we want to keep this machine in service or decommission due to lack of warranty + hardware issues (IPMI + firmware won't update.)
RobH added a subscriber: Papaul.

I'm going to assign this to @fgiunchedi for his feedback regarding the potential decommission of ms-be2010. I'm uncertain as to the roadmap for replacement of this particular system, and how its loss may affect his planning.

Please provide feedback. All hardware decoms also require that we get @mark (or @faidon in his stead) approval.

The only other thing I could think to do is have @Papaul offline the host and try to flash the bios (not the idrac) and see if that resolves things. It will require downtime.

We can decom the old ms-be machines as soon as the new ms-be hardware is fully in service. Specifically for ms-be2010 I wouldn't spend too much time after fixing its idrac, it could be one of the first machines we decom and be done with it. The rest of decoms can happen at whichever pace we'd like.

We can decom the old ms-be machines as soon as the new ms-be hardware is fully in service. Specifically for ms-be2010 I wouldn't spend too much time after fixing its idrac, it could be one of the first machines we decom and be done with it. The rest of decoms can happen at whichever pace we'd like.

HUZZAH! =]

ms-be2010 is decom'ed now, resolving.