Page MenuHomePhabricator

Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot
Closed, ResolvedPublic

Description

Per the IRC discussion, the sre.hardware.upgrade-firmware is a very much welcome addition and if you have ever tried to update the firmware manually, you will realize it's value! So thank you to everyone who worked on that.

Currently for a NIC firmware installation, the cookbook immediately reboots the host. It will be great to have a feature whereby we can defer the installation to the next reboot, similar to the web interface where it says "Install on reboot". The reason for doing this as an example is where we are trying to upgrade the cp hosts to bullseye. The d-i stalls if the NIC firmware is not updated so we have to update the firmware before proceeding with the reimage.

The workflow currently looks like:

  • depool host, update the firmware by running the cookbook, host reboots
  • run the reimaging cookbook, the host reboots again

Ideally what it can look like:

  • update the firmware by running the cookbook, installation is scheduled for next reboot
  • run the reimaging cookbook, which reboots, updates the firmware and then proceeds to the installation

This is not urgent but it will be good to have.

Thank you!

Event Timeline

ssingh renamed this task from Feature request: sre.hardware.upgrade-firmware should allow option to defer firmware installation to next reboot to Feature request: sre.hardware.upgrade-firmware should allow option to defer NIC firmware installation to next reboot.Nov 23 2022, 4:08 PM
ssingh triaged this task as Medium priority.

Since we started reimaging the cp hosts to bullseye, this has come up again and I was looking at the source:

In sre/hardware/upgrade-firmware.py#L733:

self._ask_confirmation(
    f"{redfish_host.hostname}: About to reboot to apply update, please confirm"
)
if self.new:
    redfish_host.chassis_reset(ChassisResetPolicy.FORCE_RESTART)
else:
    self.spicerack.run_cookbook(
        "sre.hosts.reboot-single",
        [netbox_host.fqdn, "--reason", "bios upgrade"],
    )

Is it fine to skip this step? Put differently, I am wondering if this is equivalent to "install on next reboot" on the web management interface. If I capture the request on the web interface (NIC firmware update), I see:

<Repository><target>DCIM:INSTALLED#701__NIC.Slot.2-2-1</target><rebootType>0</rebootType></Repository>

I don't know much about this but would 0 indicate no reboot? If yes, would this then mean that if we can skip the reboot in the case of the firmware upgrade cookbook as well (assuming it is an optional argument), that this would be the same thing as "install next reboot" and thus helping solve the above?

Thanks!

@ssingh i have created a patch to defer reboots until all drivers have been uploaded. Are you able to let me know a host i can test on?

If I capture the request on the web interface ...

FYI we are using the redfish API which is slightly different to the web interface

Mentioned in SAL (#wikimedia-operations) [2023-01-26T12:41:40Z] <sukhe> depool cp3051.esams.wmnet for firmware update testing: T323717

Mentioned in SAL (#wikimedia-operations) [2023-01-26T12:42:50Z] <sukhe@cumin2002> START - Cookbook sre.hosts.downtime for 3:00:00 on cp3051.esams.wmnet with reason: T323717

Mentioned in SAL (#wikimedia-operations) [2023-01-26T12:43:05Z] <sukhe@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp3051.esams.wmnet with reason: T323717

Since we started reimaging the cp hosts to bullseye, this has come up again and I was looking at the source:

In sre/hardware/upgrade-firmware.py#L733:

self._ask_confirmation(
    f"{redfish_host.hostname}: About to reboot to apply update, please confirm"
)
if self.new:
    redfish_host.chassis_reset(ChassisResetPolicy.FORCE_RESTART)
else:
    self.spicerack.run_cookbook(
        "sre.hosts.reboot-single",
        [netbox_host.fqdn, "--reason", "bios upgrade"],
    )

Is it fine to skip this step? Put differently, I am wondering if this is equivalent to "install on next reboot" on the web management interface. If I capture the request on the web interface (NIC firmware update), I see:

<Repository><target>DCIM:INSTALLED#701__NIC.Slot.2-2-1</target><rebootType>0</rebootType></Repository>

I don't know much about this but would 0 indicate no reboot? If yes, would this then mean that if we can skip the reboot in the case of the firmware upgrade cookbook as well (assuming it is an optional argument), that this would be the same thing as "install next reboot" and thus helping solve the above?

Thanks!

Thanks @jbond! cp3051.esams.wmnet is depooled and downtimed for three hours and ready for you. I am sure you know but as a reminder for the R440:

https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Urgent_Firmware_Revision_Notices:

Broadcom NetExtremeE firmware for 10G nic should only upgrade to 21.85.21.92, as 22.00.07.60 breaks installer.
iDrac shouldn't upgrade to 6.00.00.00 (breaks https mgmt access), cap at 5.10.30.00.

This is what I have been following for the cp hosts and it has worked so far.

If I don't upgrade the iDRAC firmware, the NIC firmware fails to update for me so I have been doing both. But IIRC, the iDRAC update didn't ask me for a reboot but I might be mistaken. (The NIC one for sure did.)

Mentioned in SAL (#wikimedia-operations) [2023-01-26T16:13:33Z] <sukhe@cumin2002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp3051.esams.wmnet with reason: extending downtime: T323717

Mentioned in SAL (#wikimedia-operations) [2023-01-26T16:13:49Z] <sukhe@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp3051.esams.wmnet with reason: extending downtime: T323717

iDrac shouldn't upgrade to 6.00.00.00 (breaks https mgmt access), cap at 5.10.30.00.

FYI its safe to update to the most recent idrac version now, can you update where ever this information is?

iDrac shouldn't upgrade to 6.00.00.00 (breaks https mgmt access), cap at 5.10.30.00.

FYI its safe to update to the most recent idrac version now, can you update where ever this information is?

This is from https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Urgent_Firmware_Revision_Notices: ... is it safe to do 6.x now?

Hi ssingh ,

i have just tested this by trying to upgrade the bios and the nic with only one upgrade, however im getting the following error message:

"A deployment or update operation is already in progress. Wait for the operation to conclude and then re-try."

suggesting its not possible to stack updates and they must be done one at a time

iDrac shouldn't upgrade to 6.00.00.00 (breaks https mgmt access), cap at 5.10.30.00.

FYI its safe to update to the most recent idrac version now, can you update where ever this information is?

This is from https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Urgent_Firmware_Revision_Notices: ... is it safe to do 6.x now?

Thanks have updated there and yes its safe to do 6+ now

Hi ssingh ,

i have just tested this by trying to upgrade the bios and the nic with only one upgrade, however im getting the following error message:

"A deployment or update operation is already in progress. Wait for the operation to conclude and then re-try."

suggesting its not possible to stack updates and they must be done one at a time

I see! Thanks for checking and looking into this. So this means that we have to reboot -- it's fine and not a big deal and certainly beats using the web interface!

Thanks @jbond for the patch and help! I can confirm that:

sudo cookbook -vvvv  -c /home/jbond/cookbook.yaml sre.hardware.upgrade-firmware "cp3051.esams.wmnet"  --no-reboot -c  nic -c idrac

Worked for cp3051: the NIC firmware installation was deferred to the next reboot and it was installed successfully, helping us avoid the double reboot.

Thanks for your help, it's much appreciated!

BCornwall claimed this task.

Looks like this has been fixed, so I'll close.