Virtually all of cp30[34]x (the newest esams cache hardware) have had issues getting through a simple reboot successfully. The symptoms ranged between:
- No hardware shutdown/reboot after systemd done shutting down software (just hangs after final systemd output forever, until racadm serveraction powercycle)
- Hardware reboot happens, but during initial boot (before entering GRUB), a bunch of garbled junk text is displayed (wrong serial port settings?), but then grub does eventually boot and everything comes up fine (rare).
- As above, but never reaches grub
- As above, reaches grub and then kernel, but kernel dies during initial boot somewhere around:
[ 9.841835] ipmi_si ipmi_si.0: Using irq 10 [ 9.842962] ------------[ cut here ]------------
On some servers, I've been able to recover from cases 3/4 via some invocation of racadm's powerdown/up/cycle commands and racadm racreset. So far there were two machines I was not able to recover in this manner, cp3032 and cp3039. In those cases, I've found a temporary workaround, which is to edit the kernel commandline from grub and add a temporary (for one boot only) modprobe.blacklist=ipmi_si, so they're back online for now.
I've checked one example server against one of its twins (which works fine) in eqiad, and every page of BIOS/iDRAC settings looks identical. My feelings tend to be that this is BMC/iDRAC -related, something to do with power management. I *suspect* the garbled text output with bad serial settings represents some kind of followup to a firmware update or a firmware command (that executes on reboot). It's perhaps as simple as that these servers need a hard poweroff (power plugs removed) to reset from whatever bad state they're in.