- - Provide FQDN of system.
- - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
- -
Put system into a failed state in Netbox.(systems still in swift rings) - - Provide urgency of request, along with justification (redundancy, dependencies, etc)
- - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
- - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.
Two Swift backends - ms-be1057.eqiad.wmnet and ms-be1058.eqiad.wmnet have misbehaving iLOs - specifically, no remote ipmitool command works e.g.:
mvernon@cumin2002:~$ sudo ipmitool -I lanplus -H "ms-be1057.mgmt.eqiad.wmnet" -U root -E chassis power status Unable to read password from environment Password: Error: Unable to establish IPMI v2 / RMCP+ session mvernon@cumin2002:~$ sudo ipmitool -I lanplus -H "ms-be1058.mgmt.eqiad.wmnet" -U root -E chassis power status Unable to read password from environment Password: Error: Unable to establish IPMI v2 / RMCP+ session
This means that e.g. the reimage cookbook is unable to operate (so these nodes are blocking T279637 hence the urgency). I have followed the troubleshooting instructions, and other than remote-ipmi, everything seems as you would expect:
mvernon@ms-be1057:~$ sudo ipmi-chassis --get-chassis-status System Power : on Power overload : false Interlock : inactive Power fault : false Power control fault : false Power restore policy : Restore Last Power Event : unknown Chassis intrusion : inactive Front panel lockout : inactive Drive Fault : false Cooling/fan fault : false Chassis Identify state : off Power off button : enabled Reset button : enabled Diagnostic Interrupt button : enabled Standby button : enabled Power off button disable : unallowed Reset button disable : unallowed Diagnostic interrupt button disable : unallowed Standby button disable : unallowed mvernon@ms-be1057:~$ sudo ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Channel_Privilege_Limit=Administrator" --key-pair="Lan_Channel:Non_Volatile_Channel_Privilege_Limit=Administrator" --diff mvernon@ms-be1057:~$ sudo ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Access_Mode=Always_Available" --key-pair="Lan_Channel:Non_Volatile_Access_Mode=Always_Available" --diff mvernon@ms-be1057:~$ sudo facter -p ipmi_chassis.boot_flags.device NO-OVERRIDE
[likewise ms-be1058]
I have tried a reset of the iLO via reset /map1 from the managment interface; this reset hasn't fixed it. The only other thing suggested on the troubleshooting list is a cold reset (shut down, unplug power cables) which might be worth a shot?
The complication is that both nodes are live swift backends - so they contribute to the 3x replication of data. So one or other can get shut down while you work on it, but I don't want to leave either down for a long time if possible, and particularly we can't have more than one off at once. Sorry!
Happy to co-ordinate a good time for me to shut one or other host off - either here or IRC.