Page MenuHomePhabricator

Unresponsive/misconfigured iDRACs over the host-BMC interface
Closed, ResolvedPublic

Description

As part of the issues yesterday, we started looking into monitoring our iDRAC/iLOs (also see T169321). I deployed a Puppet fact today that fetches the IP/MAC address etc. from the BMCs. The fact basically runs bmc-config -o -S Lan_Conf. As part of that, I found various BMCs (mostly Dells) in weird states and was able to fix some of them. These states were:

  • Returning stale IP addresses (fixed with a racadm racreset)
  • Returning the IP address, then hanging for a while and never returning MAC address/gateway/netmask
  • Not returning the MAC address
  • Unresponsive from within the machine, but responsive from the network. In some cases fixed with racadm racreset, in others not responding at all
  • Completely unresponsive from both within the machine, as well as externally.

For the ones I didn't manage to fix, we'll need to put out of commission, drain flea power and power on again. If that doesn't fix it, we should do an iDRAC firmware upgrade as well (or perhaps we should do it regardless, if it's easy).

Servers with unresponsive iDRACs:

  • db1053.eqiad.wmnet
  • sodium.wikimedia.org
  • mw1182.eqiad.wmnet
  • mw1190.eqiad.wmnet
  • mw1191.eqiad.wmnet
  • mw1196.eqiad.wmnet (T170441)
  • mw1199.eqiad.wmnet
  • mw2154.codfw.wmnet
  • mw2201.codfw.wmnet (T170307)
  • mw2202.codfw.wmnet (T170307)
  • labsdb1001.eqiad.wmnet (Cisco, ignore)
  • labsdb1003.eqiad.wmnet (Cisco, ignore)

Servers with responsive iDRAC/iLO, returning wrong LAN information (different from what configured), can probably be fixed with a BMC reset:

  • analytics1047.eqiad.wmnet
  • analytics1061.eqiad.wmnet
  • cp1045.eqiad.wmnet
  • ganeti1006.eqiad.wmnet
  • logstash1005.eqiad.wmnet
  • mw1230.eqiad.wmnet
  • ores1008.eqiad.wmnet
  • stat1004.eqiad.wmnet
  • auth2001.codfw.wmnet
  • db2074.codfw.wmnet
  • cp2010.codfw.wmnet
  • mw2172.codfw.wmnet
  • mw2204.codfw.wmnet
  • pc2006.codfw.wmnet

Event Timeline

is it okay to power off or do these need to be scheduled?

mw1196 did not come back, fans were running but no output on crash cart. downed it but probably needs decom

In addition to these, the following return 192.168.0.0/16 addresses for either their address or the gateway:

  • analytics1047.eqiad.wmnet
  • analytics1061.eqiad.wmnet
  • auth2001.codfw.wmnet
  • cp1045.eqiad.wmnet
  • cp2010.codfw.wmnet
  • db2074.codfw.wmnet
  • ganeti1006.eqiad.wmnet
  • logstash1005.eqiad.wmnet
  • mw1230.eqiad.wmnet
  • mw2172.codfw.wmnet
  • mw2204.codfw.wmnet
  • ores1008.eqiad.wmnet
  • pc2006.codfw.wmnet
  • stat1004.eqiad.wmnet

These may be fixed by an iDRAC/iLO reset (racadm racreset or cd /map1, reset respectively).

faidon renamed this task from Unresponsive iDRACs to Unresponsive/misconfigured iDRACs.Jul 4 2017, 3:13 PM
faidon updated the task description. (Show Details)
faidon added a subscriber: Papaul.

Regarding sodium it seems to me that puppet runs get stuck because it try to execute ipmi-config while loading facts:

143053 pts/0    S+     0:00  |                       \_ /bin/bash /usr/local/sbin/run-puppet-agent
143080 pts/0    Sl+    0:02  |                           \_ /usr/bin/ruby /usr/bin/puppet agent --onetime --no-daemonize --verbose --no-splay --show_diff --ignorecache --no-usecacheonfailure
143331 pts/0    S+     0:29  |                               \_ /usr/sbin/ipmi-config --category=core -o -S Lan_Conf

It eventually get unblocked after 10 minutes printing Unable to get Number of Users and the puppet run continues.

# time run-puppet-agent
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Unable to get Number of Users
Info: Caching catalog for sodium.wikimedia.org
Info: Applying configuration version '1499726057'
Notice: /Stage[main]/Mirrors::Serve/Letsencrypt::Cert::Integrated[mirrors]/Exec[acme-setup-acme-mirrors]/returns: executed successfully
Notice: Finished catalog run in 10.25 seconds

real	10m24.171s
user	0m46.704s
sys	0m7.784s
Dzahn subscribed.

analytics1047 already seemed ok, showed the right IP, racreset anyways but it stayed the same

analytics1061 also showed the right IP but the wrong gateway. (192.168.0.1 instead of 10.65.0.1 with IP 10.65.4.99) but it was fixed after a racreset.

I racreset all of the ones in list which had a discrepancy of their IP configuration with the output (showing 192.168.0.1 as gateway) and they're all fixed now.

I tried to racreset all ones in the "somewhat responsive" list and actually made them worse :) It seems that racreset from either bmc-device, SSH (even racadm racreset hard -f) or the web interface is not actually resetting properly. bmc-config stopped working but the BMC never reset properly (e.g. my SSH session remained active). These will need a proper power reset (thankfully they're just mw* hosts).

faidon renamed this task from Unresponsive/misconfigured iDRACs to Unresponsive/misconfigured iDRACs over the host-BMC interface.Jul 11 2017, 12:39 AM

Mentioned in SAL (#wikimedia-operations) [2017-07-11T17:20:45Z] <mutante> mw2201, mw2202 - depool appservers for T169360 (drain flea power)

on both mw2201 and mw2202 I am getting

I can not reset the IDRAC in the BIOS also.
This looks like HW problem will have to contact Dell for main board replacement.

worked with papaul to drain flea power for the remaining codfw ones:

mw2154 has been fixed after draining flea power and now works
mw2201/mw2202 did not get fixed by it, see papaul's comment above - creating subtask to contact Dell and replace board

Mentioned in SAL (#wikimedia-operations) [2017-07-11T19:17:58Z] <paravoid> shutting down sodium for iDRAC reset (T169360)

faidon claimed this task.

So it seems like the remaining ones are:

  • labsdb100{1,3}: Ciscos, ignore (T142807)
  • mw1196: broken, to be decom'ed (T170441)
  • mw2201/2002: broken, to be replaced (T170307)

I think we can resolve this and its parent for all intents and purposes and track the semi-relevant tasks above instead.

mw2202 is fixed as well.