Maniphest T201671

Transient failures of IPMI commands to elastic2017
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Gehel
	Aug 10 2018, 9:27 AM

Description

While reimaging the elasticsearch cluster, I've seen multiple failures of IPMI commands (see log below), but the same command can also succeed. The command is run from sarin to elastic2017 (so in the same DC). I have seen the same error from other elasticsearch servers, but it seems easier to reproduce on elastic2017.

gehel@sarin:~$ sudo ipmitool -I lanplus -H elastic2017.mgmt.codfw.wmnet -U root -E chassis power status
Unable to read password from environment
Password: 
> Error: no response from RAKP 1 message
Error: Received an Unexpected RAKP 2 message
Bad response length, len=52
Unable to get Chassis Power Status

Event Timeline

Gehel created this task.Aug 10 2018, 9:27 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 10 2018, 9:27 AM

Mentioned in SAL (#wikimedia-operations) [2018-08-10T12:27:34Z] <gehel> resetting management card on elastic2017 - T201671

Resetting the mgmt card might help, according to https://wikitech.wikimedia.org/wiki/Management_Interfaces#Reset_the_management_card

Note: update the documentation above if we find another cause to the issue while investigating.

Looks like a reset of the mgmt card fixed the issue.

EBernhardson moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.May 6 2019, 4:00 PM

Transient failures of IPMI commands to elastic2017Closed, ResolvedPublicActions

Description

Event Timeline

Transient failures of IPMI commands to elastic2017
Closed, ResolvedPublic
Actions