Page MenuHomePhabricator

cp30[34]x hw/firmware/BMC issues
Closed, ResolvedPublic

Description

Virtually all of cp30[34]x (the newest esams cache hardware) have had issues getting through a simple reboot successfully. The symptoms ranged between:

  1. No hardware shutdown/reboot after systemd done shutting down software (just hangs after final systemd output forever, until racadm serveraction powercycle)
  1. Hardware reboot happens, but during initial boot (before entering GRUB), a bunch of garbled junk text is displayed (wrong serial port settings?), but then grub does eventually boot and everything comes up fine (rare).
  1. As above, but never reaches grub
  1. As above, reaches grub and then kernel, but kernel dies during initial boot somewhere around:
[    9.841835] ipmi_si ipmi_si.0: Using irq 10
[    9.842962] ------------[ cut here ]------------

On some servers, I've been able to recover from cases 3/4 via some invocation of racadm's powerdown/up/cycle commands and racadm racreset. So far there were two machines I was not able to recover in this manner, cp3032 and cp3039. In those cases, I've found a temporary workaround, which is to edit the kernel commandline from grub and add a temporary (for one boot only) modprobe.blacklist=ipmi_si, so they're back online for now.

I've checked one example server against one of its twins (which works fine) in eqiad, and every page of BIOS/iDRAC settings looks identical. My feelings tend to be that this is BMC/iDRAC -related, something to do with power management. I *suspect* the garbled text output with bad serial settings represents some kind of followup to a firmware update or a firmware command (that executes on reboot). It's perhaps as simple as that these servers need a hard poweroff (power plugs removed) to reset from whatever bad state they're in.

Event Timeline

BBlack raised the priority of this task from to Medium.
BBlack updated the task description. (Show Details)
BBlack added projects: Traffic, ops-esams.
BBlack subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
BBlack renamed this task from cp3032 is dead to cp30[34]x hw/firmware/BMC issues.Feb 8 2016, 10:07 PM
BBlack updated the task description. (Show Details)
BBlack set Security to None.

Add cp3043 to the list of nodes that needed ipmi_si blacklist

So for the record, the total list of hosts that are now running with ipmi_si blacklisted are: cp3032, cp3039, cp3043, and cp3045

(if I had to guess, these machines won't correctly reboot/poweroff due to that, but who knows until we try)

FTR: I think I've done the blacklist hack on a couple more since, but not recorded them here. There was some suggestion elsewhere that we may need an iDRAC firmware update that papaul has applied at codfw...

Mentioned in SAL [2016-05-31T16:07:45Z] <bblack> depooling cp3032 to investigate T126062

Supporting the theory that these need firmware updates....

cp2001 racadm getversion:

Bios Version                     = 1.2.10
iDRAC Version                    = 2.10.10.10
Lifecycle Controller Version     = 2.10.10.10
IDSDM Version                    = NA

cp3032 racadm getversion:

Bios Version                     = 1.0.4
iDRAC Version                    = 2.02.01.01
Lifecycle Controller Version     = 2.02.01.01

Latest on Dell's site seems to be http://www.dell.com/support/home/us/en/19/Drivers/DriversDetails?driverId=5GCHC - going to reconfirm we still have issues, then try updating to that over tftp for a fix...

So... cp3032 rebooted fine via software, after I had done a preemptive racadm racreset. Will move on to a few others that were known-problems in the past and see how they fare until I find one that still reproduces.

Mentioned in SAL [2016-05-31T17:17:55Z] <bblack> depooled reboot of cp3040 - T126062

Mentioned in SAL [2016-05-31T19:50:45Z] <bblack> depooled reboot of cp3030 - T126062

Mentioned in SAL [2016-05-31T19:57:03Z] <bblack> depooled reboot of cp3031 - T126062

Mentioned in SAL [2016-05-31T20:02:34Z] <bblack> depooled reboot of cp3032 - T126062

Mentioned in SAL [2016-05-31T20:02:54Z] <bblack> depooled reboot of cp3033 (not 3032) - T126062

Mentioned in SAL [2016-05-31T20:09:55Z] <bblack> depooled reboot of cp3041 - T126062

Mentioned in SAL [2016-05-31T20:23:55Z] <bblack> depooled reboot of cp3042 - T126062

Mentioned in SAL [2016-05-31T20:28:34Z] <bblack> depooled reboot of cp3043 - T126062

BBlack claimed this task.

All of cache_text in esams (8/12 of the nodes considered affected) have rebooted into 4.4.2-3+wmf1 today without issue. It could be that some of the failures were related to an incompatibility between the previous 3.19 kernels and this BMC firmware (which is since fixed on the linux side), and/or the failures to reach grub could've been related to leftover junk from past enqueued BMC post-reboot operations. Either way, in practice this seems resolved now, assuming we don't hit again during esams cache_upload.