Page MenuHomePhabricator

reimage cookbook failure due to ipmi settings
Closed, ResolvedPublic

Description

Fairly often I have a reimage that 'works' but is diagnosed as a failure by the reimage cookbook due to ipmi settings.

I recognize this is meant to detect a case where the host will re-launch the installer on reboot, but I've never actually seen that happen; in all cases pxe has been disabled properly but the script has nevertheless declared a failure.

This is especially bad because the check is AFTER the initial puppet run, which means I don't know about the failure until I've waited an hour. My requests are:

  1. Check ipmi settings as soon as possible (rather than immediately before reboot)
  2. Make that check more forgiving of unexpected flags.

Here's the latest failure (on cloudnet1003, an HP server):

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 234, in run
    raw_ret = runner.run()
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 491, in run
    self.ipmi.check_bootparams()
  File "/usr/lib/python3/dist-packages/spicerack/ipmi.py", line 125, in check_bootparams
    raise IpmiCheckError(f"Expected BIOS boot params in {IPMI_SAFE_BOOT_PARAMS} got: {param}")
spicerack.ipmi.IpmiCheckError: Expected BIOS boot params in ('0000000000', '8000020000') got: 0004000000
**The reimage failed, see the cookbook logs for the details**

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This might be an HP-specific behaviour, to be investigated if it doens't reset the Force PXE bit on reboot normally. If confirmed we can add a reset of the force PXE to the coojbook. (or we can do it anyway).

Change 774926 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] ipmi: add remove_boot_override, improve force_pxe

https://gerrit.wikimedia.org/r/774926

Change 774927 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.hosts.reimage: call Ipmi.remove_boot_override

https://gerrit.wikimedia.org/r/774927

Change 774926 merged by jenkins-bot:

[operations/software/spicerack@master] ipmi: add remove_boot_override, improve force_pxe

https://gerrit.wikimedia.org/r/774926

Change 774927 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.reimage: call Ipmi.remove_boot_override

https://gerrit.wikimedia.org/r/774927

Volans claimed this task.

With the above patch merged the problem should not happen anymore, if it does please re-open the task, I'm boldly resolving it for now.