Page MenuHomePhabricator

remote ipmi doesn't work for es2013
Closed, ResolvedPublic

Description

ipmi calls seem to fail when trying to reimage. Either the service is disabled, degraded or there is some other configuration error. Needs checking. Regular management inferface works as intended, but it forced a manual reimage.

We should check logical and network configuration, then try a "power drain" to get it to respond.

Event Timeline

jcrespo created this task.Apr 11 2018, 2:26 PM
Restricted Application added a project: Operations. · View Herald TranscriptApr 11 2018, 2:26 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

T150160 suggests racadm reset may fix it.

Marostegui triaged this task as Normal priority.
Marostegui moved this task from Triage to In progress on the DBA board.

@Marostegui is it okay for me to reboot the server?

@Papaul let me double check with @jcrespo as he is/was working with esXXXX servers.

Not now, I will have to depool it. Give me 5 minutes.

jcrespo claimed this task.Apr 12 2018, 2:46 PM
jcrespo lowered the priority of this task from Normal to Low.
jcrespo removed a project: ops-codfw.

@Papaul @Marostegui Please don't do anything until it is clear what is the issue.

Now that I have a way to test it, we can proceed, depooling:

$ ipmitool -I lanplus -H es2013.mgmt.codfw.wmnet -U root -E chassis power status 
Unable to read password from environment
Password: 
Error: Unable to establish IPMI v2 / RMCP+ session
$ ipmitool -I lanplus -H es2012.mgmt.codfw.wmnet -U root -E chassis power status 
Unable to read password from environment
Password: 
Chassis Power is on
$ ipmitool -I lanplus -H es2011.mgmt.codfw.wmnet -U root -E chassis power status 
Unable to read password from environment
Password: 
Chassis Power is on

Change 425835 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool es2013

https://gerrit.wikimedia.org/r/425835

Change 425835 merged by Jcrespo:
[operations/mediawiki-config@master] mariadb: Depool es2013

https://gerrit.wikimedia.org/r/425835

jcrespo reassigned this task from jcrespo to Papaul.Apr 12 2018, 3:06 PM
jcrespo added a project: DC-Ops.

@Papaul you are now free to handle the server- it is up, but with all the service down and depooled. I would try the reset I proposed earlier first, and if that doesn't work, checking bios/admin config, maybe?

jcrespo raised the priority of this task from Low to Normal.Apr 12 2018, 3:06 PM

The reset a previous ticket suggested was T191977#4123270 (racadm reset)

1- Power drain
2- Reset IDRAC
3- Update BIOS from 2.1.7 to 2.7.1
4- Update IDRAC from 2.21 to 2.52

Papaul reassigned this task from Papaul to jcrespo.Apr 12 2018, 3:55 PM

It is still not working:

root@neodymium:/home/marostegui#  ipmitool -I lanplus -H es2013.mgmt.codfw.wmnet -U root -E chassis power status
Unable to read password from environment
Password:
Error: Unable to establish IPMI v2 / RMCP+ session

Mentioned in SAL (#wikimedia-operations) [2018-04-12T16:07:20Z] <marostegui> Reboot es2013 - T191977

After the reboot  @Papaul suggested, it still doesn't work :-(

Volans closed this task as Resolved.Apr 12 2018, 4:46 PM
Volans added a subscriber: Volans.

I've fixed it, it was a case of password misalignment, see one of the cases described in T150160,