Page MenuHomePhabricator

Add an option to the reimage cookbook to also update firmware
Open, MediumPublic

Description

For a reimage the server will be out of service anyway and having the latest firmware will make reimages smoother (one of the old maps nodes could only PXE boot reliably after I had upgraded the years old firmware).

We could add a new option "--update-firmware" to the reimage cookbook, which check if it's a Dell (since the firmware cookbook doesn't support Supermicro yet) and runs the sre.hardware.upgrade-firmware cookbook before the actual reimage.

(Opened after an IRC discussion with Manuel, adding him as subscriber)

Event Timeline

This comment was removed by Marostegui.
LSobanski triaged this task as Medium priority.Nov 24 2025, 3:52 PM

For a little bit more background we most regularly encounter PXEboot failures due to a firmware version on hosts with Broadcom BCM57412 or BCM57414 NICs.

The known-good firmware version to use on these is 21.85.21.92 (Network_Firmware_RXP80_WN64_21.85.21.92.EXE).

I wonder if we are worried about any wider "blast radius" of changes to systems if it might make sense to identify systems that have this NIC and may be on the wrong firmware version? And only run the provision cookbook in that case? Or possibly do it anyway but always use the -c nic flag to only adjust the NIC firmware?

If we are looking at being safer. Overall the proposal makes sense to me though, I guess we need to ensure the available firmware files are all "known good".

Hey Moritz and Cathal,

Just wanted to add my .02 as someone who's been bitten a few times by the firmware stuff, including writing my own automation to update 80 hosts when I first started in 2022 for T312298.

Speaking as a cookbook user whose team owns about 20% of the fleet, I am not too keen on another non-default flag that's needed 100% of the time for a successful reimage. Sadly, I am absent-minded enough that I can learn about --no82 on Friday, tell my entire team about it, and forget about it completely by Monday. ;(

Admittedly, that's more on me, and y'all have been extremely helpful unblocking me while also working through some very complex networking issues.

But I wonder if y'all wouldn't get more mileage out of auditing for old firmware and proactively updating the fleet everywhere? I did that in T331297 and we didn't have to worry about a time bomb waiting for us at the next reimage. Or maybe just warn the user via a banner display (ref https://phabricator.wikimedia.org/T410751#11401347 ) ?

I know DRACs/Redfish are pretty janky, and in my experience the firmware update cookbooks fail about 1/2 of the time. That's why I used Dell's OS-level firmware update scripts in my automation. I've never used it, but maybe Supermicro Update Manager would work better than Redfish?

If we could figure out how to reliably make config changes to DRACs at scale, we could respond more quickly to security issues or critical firmware updates ( T394348 ).