Page MenuHomePhabricator

Broken IPMI/drac on cp3038 and cp3045
Closed, ResolvedPublic

Description

cp3038 didn't come back up after rebooting it today. cp3038.mgmt.esams.wmnet responds to ping but port 22 seems to be closed:

$ nc cp3038.mgmt.esams.wmnet 22
cp3038.mgmt.esams.wmnet [10.21.0.159] 22 (ssh) : Connection refused

I did manage to get chassis status output:

$ sudo ipmitool -I lanplus -H cp3038.mgmt.esams.wmnet -U root -E chassis status
Unable to read password from environment
Password: 
System Power         : on
Power Overload       : false
Power Interlock      : inactive
Main Power Fault     : false
Power Control Fault  : false
Power Restore Policy : previous
Last Power Event     : 
Chassis Intrusion    : inactive
Front-Panel Lockout  : inactive
Drive Fault          : false
Cooling/Fan Fault    : false
Sleep Button Disable : not allowed
Diag Button Disable  : allowed
Reset Button Disable : not allowed
Power Button Disable : allowed
Sleep Button Disabled: false
Diag Button Disabled : true
Reset Button Disabled: false
Power Button Disabled: false

ipmiconsole hangs like this:

$ sudo ipmiconsole -u root -P -h cp3038.mgmt.esams.wmnet
Password: 
[SOL established]

I've also tried chassis power cycle but that didn't help.

Event Timeline

ema renamed this task from Broken IPMI on cp3038 to Broken IPMI on cp3038, host failed coming back online after a reboot .Feb 8 2017, 10:03 AM
ema added a project: DC-Ops.

I've just tried to get the drac back as follows:

sudo ipmitool -I lanplus -H cp3038.mgmt.esams.wmnet -U root mc reset cold

And that resulted in the host coming back online. Still no ssh drac access though.

ema renamed this task from Broken IPMI on cp3038, host failed coming back online after a reboot to Broken IPMI/drac on cp3038.Feb 8 2017, 10:37 AM
ema added a project: ops-esams.

cp3038 is under warranty until 2018-03-04. It has a service tag of 45MYV42.

When the DRAC stops working, often simply removing ALL power to the system will reset the drac's faulty condition. It may be that simple and we'll have a working server. This will require onsite assistance.

@mark: Would you like us to wait for you to be onsite, or open a smart hands request to have the power cables removed and added back to this host?

As this is a non-urgent host, I wanted to get your approval to use smart hands in advance. (Since we have so many hours of it monthly that we likely do not use up.)

Please advise,

Got bitten by this again today while rebooting cp3038 into Linux 4.9. I'd say it's time to fix this machine.

Note that this time there is absolutely no way to bring the host back online (i.e.: even chassis status fails)

$ sudo ipmitool -I lanplus -H cp3038.mgmt.esams.wmnet -U root -E chassis status
Unable to read password from environment
Password: 
Error: Unable to establish IPMI v2 / RMCP+ session
ema renamed this task from Broken IPMI/drac on cp3038 to Broken IPMI/drac on cp3038 and cp3045.Apr 13 2017, 3:11 PM

Same issue on cp3045: the mgmt IP is reachable but I can't ssh into it. Further, chassis status fails with:

Error: Unable to establish IPMI v2 / RMCP+ session

Both cp3045 and cp3038 are currently down.

I just contacted EvoSwitch remote hands requesting to perform a power swap on both of those systems. @BBlack/@ema are Cc'ed.

Both came up just a few minutes ago :)

Both hosts are indeed up and running. iDRAC is fixed too. Closing, thanks!