Page MenuHomePhabricator

Transient failures of IPMI commands to elastic2017
Closed, ResolvedPublic

Description

While reimaging the elasticsearch cluster, I've seen multiple failures of IPMI commands (see log below), but the same command can also succeed. The command is run from sarin to elastic2017 (so in the same DC). I have seen the same error from other elasticsearch servers, but it seems easier to reproduce on elastic2017.

gehel@sarin:~$ sudo ipmitool -I lanplus -H elastic2017.mgmt.codfw.wmnet -U root -E chassis power status
Unable to read password from environment
Password: 
> Error: no response from RAKP 1 message
Error: Received an Unexpected RAKP 2 message
Bad response length, len=52
Unable to get Chassis Power Status

Event Timeline

Gehel created this task.Aug 10 2018, 9:27 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 10 2018, 9:27 AM

Mentioned in SAL (#wikimedia-operations) [2018-08-10T12:27:34Z] <gehel> resetting management card on elastic2017 - T201671

Resetting the mgmt card might help, according to https://wikitech.wikimedia.org/wiki/Management_Interfaces#Reset_the_management_card

Note: update the documentation above if we find another cause to the issue while investigating.

Gehel closed this task as Resolved.Aug 10 2018, 12:33 PM
Gehel claimed this task.

Looks like a reset of the mgmt card fixed the issue.