Page MenuHomePhabricator

hw troubleshooting: remote IPMI not working for ms-be105[7-8].eqiad.wmnet
Closed, ResolvedPublicRequest

Description

  • - Provide FQDN of system.
  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
  • - Put system into a failed state in Netbox. (systems still in swift rings)
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc)
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

Two Swift backends - ms-be1057.eqiad.wmnet and ms-be1058.eqiad.wmnet have misbehaving iLOs - specifically, no remote ipmitool command works e.g.:

mvernon@cumin2002:~$ sudo ipmitool -I lanplus -H "ms-be1057.mgmt.eqiad.wmnet" -U root -E chassis power status
Unable to read password from environment
Password: 
Error: Unable to establish IPMI v2 / RMCP+ session
mvernon@cumin2002:~$ sudo ipmitool -I lanplus -H "ms-be1058.mgmt.eqiad.wmnet" -U root -E chassis power status
Unable to read password from environment
Password: 
Error: Unable to establish IPMI v2 / RMCP+ session

This means that e.g. the reimage cookbook is unable to operate (so these nodes are blocking T279637 hence the urgency). I have followed the troubleshooting instructions, and other than remote-ipmi, everything seems as you would expect:

mvernon@ms-be1057:~$ sudo ipmi-chassis --get-chassis-status
System Power                        : on
Power overload                      : false
Interlock                           : inactive
Power fault                         : false
Power control fault                 : false
Power restore policy                : Restore
Last Power Event                    : unknown
Chassis intrusion                   : inactive
Front panel lockout                 : inactive
Drive Fault                         : false
Cooling/fan fault                   : false
Chassis Identify state              : off
Power off button                    : enabled
Reset button                        : enabled
Diagnostic Interrupt button         : enabled
Standby button                      : enabled
Power off button disable            : unallowed
Reset button disable                : unallowed
Diagnostic interrupt button disable : unallowed
Standby button disable              : unallowed

mvernon@ms-be1057:~$ sudo ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Channel_Privilege_Limit=Administrator" --key-pair="Lan_Channel:Non_Volatile_Channel_Privilege_Limit=Administrator" --diff
mvernon@ms-be1057:~$ sudo ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Access_Mode=Always_Available" --key-pair="Lan_Channel:Non_Volatile_Access_Mode=Always_Available" --diff
mvernon@ms-be1057:~$ sudo facter -p ipmi_chassis.boot_flags.device
NO-OVERRIDE

[likewise ms-be1058]

I have tried a reset of the iLO via reset /map1 from the managment interface; this reset hasn't fixed it. The only other thing suggested on the troubleshooting list is a cold reset (shut down, unplug power cables) which might be worth a shot?

The complication is that both nodes are live swift backends - so they contribute to the 3x replication of data. So one or other can get shut down while you work on it, but I don't want to leave either down for a long time if possible, and particularly we can't have more than one off at once. Sorry!

Happy to co-ordinate a good time for me to shut one or other host off - either here or IRC.