We rely on remote IPMI in a lot of cases but still often have issues with it.
An audit of the reachability of IPMI across the fleet found numerous hosts for which remote IPMI is not working, wher one is not able to perform a chassis status from puppetmaster1001 via ipmitool.
Several issues have been identified:
- an IPMI misconfiguration where in the Lan_Channel section the Volatile_Access_Mode and Non_Volatile_Access_Mode (runtime value and value after next reboot) are set to Disabled instead of Always_Available.
- an IPMI misconfiguration where in the Lan_Channel section the Volatile_Channel_Privilege_Limit and Non_Volatile_Channel_Privilege_Limit (runtime value and value after next reboot) are set to Operator instead of Administrator.
- IPMI passwords getting out of sync with their iDRAC passwords. An ssh root@$hostname racadm config -g cfgUserAdmin -o cfgUserAdminPassword -i 2 $password fixes this usually
- BMCs being unresponsive to IPMI but responsive to SSH (a racadm racreset usually fixes this)
- BMCs being responsive to ping but unresponsive to SSH (this needs a power drain/cycle)
- BMCs being unresponsive to ping (this either needs a power drain/cycle, or network debugging, e.g. bad cable)
The list of remaining hosts right now is:
- conf1003.mgmt.eqiad.wmnet: unresponsive to ping, but responsive BMC from within the machine (bad cable?)
- db1063.mgmt.eqiad.wmnet: unresponsive to ping, but responsive BMC from within the machine (bad cable?)
- kafka1018.mgmt.eqiad.wmnet: unresponsive to ping, but responsive BMC from within the machine (bad cable?)
- kafka1020.mgmt.eqiad.wmnet: unresponsive to ping, but responsive BMC from within the machine (bad cable?)
- db1053.mgmt.eqiad.wmnet: responsive to ping but unresponsive to SSH & bmc-config
- restbase-dev1003.mgmt.eqiad.wmnet, see T169696
- mw1196.mgmt.eqiad.wmnet, see T169360#3395989
- sodium.mgmt.eqiad.wmnet, see T169360
- labsdb1001.mgmt.eqiad.wmnet: Cisco, ignore
- labsdb1003.mgmt.eqiad.wmnet: Cisco, ignore