Page MenuHomePhabricator

Reset db1070 idrac
Closed, ResolvedPublic

Description

Hello!

Can we get db1070's idrac reset? It is failing to reinstall because of the IPMI error:

Error: Unable to establish IPMI v2 / RMCP+ session

Thanks!

Details

Related Gerrit Patches:
operations/mediawiki-config : masterDepool db1070 for hardware maintenance
operations/mediawiki-config : masterdb-eqiad.php: Depool db1070

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 14 2017, 8:19 AM

Leaving this documented for the future. I tried a cold reset locally, but it doesn't fix the remote issue.

root@db1070:~# bmc-device --debug --cold-reset
=====================================================
Cold Reset Request
=====================================================
[               2h] = cmd[ 8b]
=====================================================
Cold Reset Response
=====================================================
[               2h] = cmd[ 8b]
[               0h] = comp_code[ 8b]
root@db1070:~# bmc-device --get-sel-time
ipmi_cmd_get_sel_time: BMC busy
root@db1070:~# bmc-device --get-sel-time
SEL Time : 04/26/2017 - 14:53:12

But that doesn't fixed the remote problem:

root@neodymium:/home/marostegui/git/software/dbtools# ipmitool -I lanplus -H db1070.mgmt.eqiad.wmnet -U root -E chassis
Unable to read password from environment
Password:
Error: Unable to establish IPMI v2 / RMCP+ session
Cmjohnson closed this task as Resolved.Apr 27 2017, 8:11 PM

Reset the idrac and it appears that db1070 is not accessible from ipmi tool

cmjohnson@db1070:~$ sudo ipmi-chassis --get-chassis-status
System Power : on
Power overload : false
Interlock : inactive
Power fault : false
Power control fault : false
Power restore policy : Restore
Last Power Event : unknown
Chassis intrusion : inactive
Front panel lockout : inactive
Drive Fault : false
Cooling/fan fault : false
Chassis Identify state : off
Power off button : enabled
Reset button : enabled
Diagnostic Interrupt button : disabled
Standby button : enabled
Power off button disable : allowed
Reset button disable : unallowed
Diagnostic interrupt button disable : allowed
Standby button disable : unallowed

Marostegui reopened this task as Open.Apr 27 2017, 8:33 PM

Unfortunately, it still doesn't work from remote:

root@neodymium:~# ipmitool -I lanplus -H db1070.mgmt.eqiad.wmnet -U root -E chassis status
Unable to read password from environment
Password:
Error: Unable to establish IPMI v2 / RMCP+ session
Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Apr 27 2017, 8:39 PM

@Marostegui I am not sure what to make of this...i know several servers have this issue but on the server itself ipmi works fine.

@Cmjohnson this looks similar as the same issue with lots of servers, including dbstore1001 on this task: T158893#3186029 if we get Dell to fix this or advise on that ticket, probably we can apply the same for this host (and other ones suffering from the same issue).

Change 351822 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1070

https://gerrit.wikimedia.org/r/351822

Change 351822 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1070

https://gerrit.wikimedia.org/r/351822

Mentioned in SAL (#wikimedia-operations) [2017-05-04T12:40:22Z] <marostegui@naos> Synchronized wmf-config/db-eqiad.php: Depool db1070 for maintenance - T160392 (duration: 01m 35s)

Mentioned in SAL (#wikimedia-operations) [2017-05-04T12:42:41Z] <marostegui> Stop MySQL db1070 for maintenance - T160392

Mentioned in SAL (#wikimedia-operations) [2017-05-04T14:09:44Z] <marostegui@naos> Synchronized wmf-config/db-eqiad.php: Repool db1070 with less weight - T160392 (duration: 01m 16s)

Mentioned in SAL (#wikimedia-operations) [2017-05-04T14:24:05Z] <marostegui@naos> Synchronized wmf-config/db-eqiad.php: Increase db1070 weight - T160392 (duration: 01m 10s)

Mentioned in SAL (#wikimedia-operations) [2017-05-04T15:03:24Z] <marostegui@naos> Synchronized wmf-config/db-eqiad.php: Restore db1070 original weight - T160392 (duration: 00m 57s)

@Cmjohnson just checking if in the end you updated the idrac firmware? No pushing by any means, just checking if I need to powercycle this host next week or not.
Thank you!

Change 352171 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] Depool db1070 for hardware maintenance

https://gerrit.wikimedia.org/r/352171

Change 352171 merged by jenkins-bot:
[operations/mediawiki-config@master] Depool db1070 for hardware maintenance

https://gerrit.wikimedia.org/r/352171

Mentioned in SAL (#wikimedia-operations) [2017-05-05T16:09:57Z] <jynus> shutting down db1070 for hw maintenance T160392

I updated the firmware on db1070 and ipmitool is still not working, I compared the idrac settings via the gui with db1068 (ipmi works) and not differences between the two that I could tell. I even tried swapping the cable to see if it could be that (longshot). Still no idea why ipmi is not working. All Dell troubleshooting requires several hours of trial and error and booting in to their Dell live image. I will need to schedule extended downtime with this or dbstore1001 or db1071 with @Marostegui or @jcrespo.

I did update the bios f/w while it was down.

Thanks @Cmjohnson for all the help.
We can test with db1070 for as much as you like during the week (I just don't like leaving hosts down for the weekend) just let me know when you need it down and I will handle it.
The only reason not to choose dbstore1001 is because it is our backups server, but if we find the issue with db1070, it is likely that dbstore (and the rest of affected hosts) are suffering the same.

Thanks for spending time on this, we appreciate it!

Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Jun 29 2017, 3:13 PM
Cmjohnson moved this task from Up next to Not urgent on the ops-eqiad board.Jun 29 2017, 3:15 PM
faidon added a subscriber: faidon.Jul 10 2017, 6:33 PM

FYI, db1071 is in a similar state, I'm not sure why.

faidon closed this task as Resolved.Jul 10 2017, 10:32 PM

OK, so I noticed that the Error: Unable to establish IPMI v2 / RMCP+ session response was immediate, like the password was wrong. So I tried changing the password to something else from the iDRAC web interface, and then changing it back to our regular one, and this seems to have done the trick for both db1070 and db1071.

It probably keeps the passwords hashed internally differently for IPMI and they got out of sync somehow. Makes you wonder on how many other hosts this happens, we should probably monitor that as well :)

Nice catch faidon!! Thanks for fixing this and specially thanks for fixing dbstore1001, which is a critical host for us!