As part of the issues yesterday, we started looking into monitoring our iDRAC/iLOs (also see T169321). I deployed a Puppet fact today that fetches the IP/MAC address etc. from the BMCs. The fact basically runs bmc-config -o -S Lan_Conf. As part of that, I found various BMCs (mostly Dells) in weird states and was able to fix some of them. These states were:
- Returning stale IP addresses (fixed with a racadm racreset)
- Returning the IP address, then hanging for a while and never returning MAC address/gateway/netmask
- Not returning the MAC address
- Unresponsive from within the machine, but responsive from the network. In some cases fixed with racadm racreset, in others not responding at all
- Completely unresponsive from both within the machine, as well as externally.
For the ones I didn't manage to fix, we'll need to put out of commission, drain flea power and power on again. If that doesn't fix it, we should do an iDRAC firmware upgrade as well (or perhaps we should do it regardless, if it's easy).
Servers with unresponsive iDRACs:
- mw1196.eqiad.wmnet (T170441)
- mw2201.codfw.wmnet (T170307)
- mw2202.codfw.wmnet (T170307)
- labsdb1001.eqiad.wmnet (Cisco, ignore)
- labsdb1003.eqiad.wmnet (Cisco, ignore)
Servers with responsive iDRAC/iLO, returning wrong LAN information (different from what configured), can probably be fixed with a BMC reset: