Page MenuHomePhabricator

Transient failures of IPMI commands to elastic2017
Closed, ResolvedPublic

Description

While reimaging the elasticsearch cluster, I've seen multiple failures of IPMI commands (see log below), but the same command can also succeed. The command is run from sarin to elastic2017 (so in the same DC). I have seen the same error from other elasticsearch servers, but it seems easier to reproduce on elastic2017.

gehel@sarin:~$ sudo ipmitool -I lanplus -H elastic2017.mgmt.codfw.wmnet -U root -E chassis power status
Unable to read password from environment
Password: 
> Error: no response from RAKP 1 message
Error: Received an Unexpected RAKP 2 message
Bad response length, len=52
Unable to get Chassis Power Status

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2018-08-10T12:27:34Z] <gehel> resetting management card on elastic2017 - T201671

Resetting the mgmt card might help, according to https://wikitech.wikimedia.org/wiki/Management_Interfaces#Reset_the_management_card

Note: update the documentation above if we find another cause to the issue while investigating.

Gehel claimed this task.

Looks like a reset of the mgmt card fixed the issue.