Page MenuHomePhabricator

SSD firmware update not working in firmware cookbook
Closed, ResolvedPublic

Description

When attempting to use the sre.hardware.upgrade-firmware cookbook, I receive the error: an-coord1004.eqiad.wmnet: skipping DellDriverCategory.STORAGE as no member

Full output of terminal commands below, with no prompt on what version to use, as it seems the cookbook needs an update for DellDriverCategory.STORAGE and I'm not certain where to do that.

robh@cumin1002:/srv/firmware/poweredge-r440/STORAGE$ ls
Serial-ATA_Firmware_6TG3F_WN32_TT03_A00.EXE  Serial-ATA_Firmware_VJPKG_WN64_DL7C_A00.EXE
robh@cumin1002:/srv/firmware/poweredge-r440/STORAGE$ sudo cookbook sre.hardware.upgrade-firmware --component storage an-coord1004*
Acquired lock for key /spicerack/locks/cookbooks/sre.hardware.upgrade-firmware: {'concurrency': 20, 'created': '2025-05-16 19:06:05.728153', 'owner': 'robh@cumin1002 [2897064]', 'ttl': 1800}
START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1004.eqiad.wmnet
Acquired lock for key /spicerack/locks/custom/sre.hardware.upgrade-firmware:an-coord1004: {'concurrency': 1, 'created': '2025-05-16 19:06:05.795246', 'owner': 'robh@cumin1002 [2897064]', 'ttl': 3600}
Management Password: 
an-coord1004.eqiad.wmnet (Gen 14): starting
an-coord1004.eqiad.wmnet: skipping DellDriverCategory.STORAGE as no member
Released lock for key /spicerack/locks/custom/sre.hardware.upgrade-firmware:an-coord1004: {'concurrency': 1, 'created': '2025-05-16 19:06:05.795246', 'owner': 'robh@cumin1002 [2897064]', 'ttl': 3600}
Released lock for key /spicerack/locks/cookbooks/sre.hardware.upgrade-firmware: {'concurrency': 20, 'created': '2025-05-16 19:06:05.728153', 'owner': 'robh@cumin1002 [2897064]', 'ttl': 1800}
END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts an-coord1004.eqiad.wmnet
robh@cumin1002:/srv/firmware/poweredge-r440/STORAGE$ ls
Serial-ATA_Firmware_6TG3F_WN32_TT03_A00.EXE  Serial-ATA_Firmware_VJPKG_WN64_DL7C_A00.EXE

Event Timeline

@Volans,

Would you be the best person to assist on resolving this issue with the firmware script? Please see task description. We're having to push SSD firmware updates to (24) hosts so I want to avoid having to do this manually via https idrac interface.

Please advise,

@RobH I can a look next week. I think we've never supported/tested single SSDs disk updates. The storage option is for raid controllers and AFAICS SSDs have a different identifier.
If that's the case let me know if I can use an-coord1004 for testing and which firmware I should use.

@bking is going to free up a cirrussearch host for us via T394432. Also he points out he has leveraged https://gitlab.wikimedia.org/repos/search-platform/sre/stage-firmware-update/-/tree/main in the past to push updates to multiple hosts at a time.

I'm not sure if this would be a better solution, as it may allow for update of firmware without reboot. If so, that would eliminate downtime. I'll defer to how automation thinks best. If it can happen without downtime though, it makes the update of the 24 affected hosts much easier!

@RobH looks like cirrussearch2110.codfw.wmnet will work for testing.

I've banned, depooled and downtimed the host for 2 days. I can extend the downtime further if necessary or feel free to run `sudo cookbook sre.hosts.downtime 'cirrussearch2110*' -D2 -t T394432 -r 'firmware update'
` whenever you are ready to operate.

Re: the above script, it only works with Dell hosts. I don't know anything about Supermicro firmware, but this blog post by one of my former co-workers mentions Supermicro SUM utility. Assuming the info is still current, that might be an easier way to do the firmware updates.

@bking is going to free up a cirrussearch host for us via T394432. Also he points out he has leveraged https://gitlab.wikimedia.org/repos/search-platform/sre/stage-firmware-update/-/tree/main in the past to push updates to multiple hosts at a time.

This is the first time I heard about that repository. Was the already existing at the time sre.hardware.upgrade-firmware cookbook not sufficient for the upgrade? It has always supported multiple target hosts as a cumin query. What was the unsupported use case? Was SRE I/F consulted or at least informed of this separate approach?
I see that the repo performs also changes to iDRAC settings. We have a pretty standardized way of configuring BIOS and BMCs via the sre.hosts.provision cookbook and we should refrain from performing manual and/or ad-hoc changes in other ways. The assumption is that you can always run the provision cookbook to ensure the host has the correct settings.
If any particular setting is needed for special cases they should be brought up and included into the provision cookbook.

@RobH Which of the two firmware file should I use?

robh@cumin1002:/srv/firmware/poweredge-r440/STORAGE$ ls
Serial-ATA_Firmware_6TG3F_WN32_TT03_A00.EXE  Serial-ATA_Firmware_VJPKG_WN64_DL7C_A00.EXE
Volans triaged this task as Medium priority.May 20 2025, 1:19 PM
Volans added a project: SRE-tools.

Sorry, I meant to update this task with that info sooner!

Serial-ATA_Firmware_VJPKG_WN64_DL7C_A00.EXE

The DL7C version.

Not to get too far off topic, but let me contextualize this.

This is the first time I heard about that repository. Was the already existing at the time sre.hardware.upgrade-firmware cookbook not sufficient for the upgrade?

I wrote this playbook back in 2022 (before a firmware update cookbook existed) and shared it with IF. I haven't used it since then because I haven't needed to update firmware at scale. You're on an email thread called Re: Request for NIC firmware update advice that is dated 14 July. Exact quote:

I used Dell's BIN script to stage the NIC firmware upgrades on our Debian 9 hosts and it worked really well. It has guardrails against installing an older firmware version, and can be run in interactive or silent mode. Overall, I'm guessing it's easier to install firmware in a Linux environment as opposed directly from the DRACs, which tend to be rather fiddly in my experience. Obviously, it won't work on brand-new servers without an OS, but they're much less likely to need firmware updates. [...] Maybe it's worth auditing the fleet, since none of those boxes will reimage until that's fixed?

I wrote it because I had to reimage ~80 hosts and I needed a way to update the firmware at scale, otherwise PXE would not work (T312298). Looking back, we were aware of this issue since at least 2021 (T286722). I wrote the script and emailed IF about it in 2022. I was surprised to see the issue resurface in 2024 (T374924) . I did a little digging and found many instances of different teams re-discovering this problem (T312298, T308106, T374924, T350179, T286722). @elukey did some great work in T363576 to find the root cause and create a flag for this. Unfortunately, it was not a default flag and so I raised T378835 to get this changed.

What I would hope to see is a more proactive approach. Which brings us back to this ticket...maybe we should have already had a plan and automation for storage updates? I don't see what's wrong with using the vendor-provided BIN scripts to update the firmware, especially if it's truly an emergency, but I'll leave that up to y'all.

I try my best to raise tickets for IF and contribute to cookbooks when possible and will continue to do so. Thanks for your patience and if I can do anything to help make this process easier for end users, please let me know.

@bking Yeah, no need to go offtopic for something almost 3y old. I have indeed forgot about the Re: Request for NIC firmware update advice email thread, sorry. But unless I'm missing something I don't see in there any mention of a parallel separate approach on a gitlab repository not using cookbooks. And at the time the firmware cookbook was almost ready from what I gather from that email thread. I don't recall the details and possibly you've chatted with John about that more than me given that he was the one working on that project. Anyway, let's look at the future :)

Today I've run some tests on a patched version of the firmware cookbook and I was able to upgrade the firmware on the provided test host (cirrussearch2110). As far as redfish shows, both SSDs are on 'Revision': '7CV1DL7C' and Rob just confirmed it shows version DL7C on the web UI.
I'm confident I can make a final version by monday/tuesday, the only small request is to get another test host by then if possible.
A reboot is necessary AFAICT, but just one per host (probably one per controller, but almost all hosts have 1 controller).

If possible let me know which host is next in line that I can use for the final test when I'll be ready on monday/tuesday, thanks in advance.

Mentioned in SAL (#wikimedia-operations) [2025-05-23T21:56:35Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch2110*,cirrussearch2111* for T394543 - bking@cumin2002

Mentioned in SAL (#wikimedia-operations) [2025-05-23T21:56:39Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch2110*,cirrussearch2111* for T394543 - bking@cumin2002

Hey Volans, I flipped the script a bit to make it a bit more readable...

Anyway, let's look at the future :)

Today I've run some tests on a patched version of the firmware cookbook and I was able to upgrade the firmware on the provided test host (cirrussearch2110). As far as redfish shows, both SSDs are on 'Revision': '7CV1DL7C' and Rob just confirmed it shows version DL7C on the web UI.
I'm confident I can make a final version by monday/tuesday, the only small request is to get another test host by then if possible.
A reboot is necessary AFAICT, but just one per host (probably one per controller, but almost all hosts have 1 controller).

If possible let me know which host is next in line that I can use for the final test when I'll be ready on monday/tuesday, thanks in advance.

cirrussearch2111 is banned and ready for you. I'll be out Monday, but feel free to ping me Tuesday if you need anything.

@bking Yeah, no need to go offtopic for something almost 3y old. I have indeed forgot about the Re: Request for NIC firmware update advice email thread, sorry

No worries, I just wanted to provide some additional context for my boss and anyone else who's reading ;)

But unless I'm missing something I don't see in there any mention of a parallel separate approach on a gitlab repository not using cookbooks.

I do think you are missing something. Specifically the Dell-provided BIN scripts, which is the interesting part of the Ansible script, not to mention that email from many moons ago. These BIN scripts have been the default way to update firmware on Linux on Dell for 20 or so years now. They're extremely battle-tested and they handle version detection, rollback, etc in a safe manner. If you follow the exact directions in the parent ticket:

While Dell always recommends applying the latest available firmware update for you platform, if your organizations change management policy requires not installing the latest release, the minimum recommended firmware version of DL7A can be located below.

Go to https://www.dell.com/support

In the search field input the driver version “DL7A” or the driver code “9RF8X”.

In the search results click on “Intel Youngsville-RR DL7A for model numbers SSDSC2KB076TZR, SSDSC2KB038TZR, SSDSC2KB019TZR, SSDSC2KB960GZR, SSDSC2KB480GZR, SSDSCKKB480GZR, SSDSC2KB240GZR, SSDSCKKB240GZR, SSDSC2KG038TZR, SSDSC2KG019TZR, SSDSC2KG960GZR, SSDSC2KG480GZR”.. | Driver Details”.

Assuming you started from Netbox, you'll end up here .

Is this the first you've heard of them? Whether we decide to use them or not, it's pretty important for someone who writes firmware automation for Linux servers to know they exist. (I know they say Red Hat, but there's also an Ubuntu version and I've never had a problem running any of the scripts on Debian).

It's good to have options that don't require high-level engineers to write code, something to keep in mind next time a vendor releases an emergency firmware update and we don't find out about it for 2 1/2 months ;(. Anyone with root access can test the firmware on a couple of hosts, and then we can stage them without without bugging you.

I've uploaded the BIN script to my homedir cirrussearch2115 if you want to run it and see the output. It'll be a no-op since I already applied it. Feel free to try it on cirrussearch211[2-4] hosts, but don't reboot w/out pinging me or @RKemper.

Good luck!

Change #1150728 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.hardware.upgrade-firmware: add support for SSD

https://gerrit.wikimedia.org/r/1150728

Thank Brian, I've upgraded the firmware of cirrussearch2111 with the above patch, it's all back to you.
The only thing that didn't work was the check of the job result because it was not there:

GET https://10.193.3.47/redfish/v1/TaskService/Tasks/JID_482912952574 returned HTTP 404

I'll investigate more tomorrow but we're close to have it working.

@bking as agreed on IRC let me know when another 1~2 hosts are ready for testing so we can complete the change for the cookbook and let everyone upgrade SSDs firmware when needed.

Mentioned in SAL (#wikimedia-operations) [2025-05-29T16:17:28Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch2112*,cirrussearch2113* for T394543 - bking@cumin2002

Mentioned in SAL (#wikimedia-operations) [2025-05-29T16:17:43Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch2112*,cirrussearch2113* for T394543 - bking@cumin2002

@Volans , the hosts cirrussearch211[2-3] are ready for your use. I've set a downtime for the next 7 days. Hit me up when you're finished so I can unban/remove downtime.

@bking great, thanks a lot. I've already done cirrussearch2112 with my latest version of the patch. I'll do `cirrussearch2113 on monday with hopefully the final version. If you prefer to have cirrussearch2113 in the pool during the weekend feel free to do so and we can ban it again on monday.

Good news: upgrading with the controller URI instead of the first disk one does update all the disks with just one Job and there is no problem of job deletion and the cookbook was able to properly control it.
I've just had to tweak a bit the cookbook to be able to update an URI and then check the version on all disks on different URIs. But I need to polish a bit this part of the code to have it ready for prime time. Almost there.

Thank you for all the work on this and polishing it up for general SRE use!

Current plan:

  • @Volans updates cookbook to state they feel is ready for general use to roll ssd firmware updates
  • @RobH rollls additional update to a server, documents process for task update
  • @RobH updates linked in sub-tasks for each service group/SRE sub-team's choice on either Rob applying updates at a scheduled time by them or they roll updates with the provided directions
  • @RobH tracks overall implementation of updates via this parent task, moves along any straggling hosts.

@bking @RKemper I'm ready with the final test for https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1150728
I see that cirrussearch2113 is pooled so I didn't proceed with it. LMK when a test host will be available, thanks in advance.

Mentioned in SAL (#wikimedia-operations) [2025-06-06T07:52:18Z] <ryankemper@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: T394543

Thanks @RKemper for the depool, I've performed the final run with the current PS in gerrit with test-cookbook for cirrussearch2113. All good.

@RobH FYI given that the SSD firmware version is not numerical and the number of chars is different between the reported version and the firmware file, I did a bit of mangling there.

So for a version reported by Redfish as 'Revision': '7CV1DL74' and a filename like Serial-ATA_Firmware_VJPKG_WN64_DL7C_A00.EXE, I generate two Version objects of the form: 1+dl74 and 1+dl7c. The 1+ is there just because it makes it so that the version library is capable of comparing them.

This means that you will see things like this in the logs:

cirrussearch2113.codfw.wmnet (SSD): target_version: 1+dl7c, current_version: 1+dl74

Luca is out in the next few days so I've just merged the change to unblock you and we can refine it when Luca is back.
@RobH you can proceed :)

Change #1150728 merged by jenkins-bot:

[operations/cookbooks@master] sre.hardware.upgrade-firmware: add support for SSD

https://gerrit.wikimedia.org/r/1150728

Forgot to mention, this is what I used to upgrade just the SSD firmware:

cookbook sre.hardware.upgrade-firmware -c ssd "cirrussearch2113.*"

Management Password: 
db1253.eqiad.wmnet (Gen 15): starting
db1253.eqiad.wmnet (SSD): update
db1253.eqiad.wmnet (SSD): current version: 1+dl7a
poweredge-r650xs: picking DellDriverCategory.SSD update file
Released lock for key /spicerack/locks/custom/sre.hardware.upgrade-firmware:db1253: {'concurrency': 1, 'created': '2025-06-12 15:14:50.471125', 'owner': 'robh@cumin2002 [412372]', 'ttl': 3600}
Exception raised while executing cookbook sre.hardware.upgrade-firmware:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 265, in _run
    raw_ret = runner.run()
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 1069, in run
    failures += self._run_host(hostname)
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 1129, in _run_host
    if not self.update_ssd_driver(redfish_host, netbox_host):
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 997, in update_ssd_driver
    target_version, job_id = self._update(
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 608, in _update
    target_version, firmware_file = getattr(self, select_firmwarefile)(
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 569, in _cached_select_firmwarefile
    return self._select_firmwarefile(*args, **kargs)
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 550, in _select_firmwarefile
    return self.get_latest(product_slug, driver_type, driver_category)
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 267, in get_latest
    raise NotImplementedError("SSD firmware fetch from DELL website not yet implemented")
NotImplementedError: SSD firmware fetch from DELL website not yet implemented
Released lock for key /spicerack/locks/cookbooks/sre.hardware.upgrade-firmware: {'concurrency': 20, 'created': '2025-06-12 15:14:50.081070', 'owner': 'robh@cumin2002 [412372]', 'ttl': 1800}
END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts db1253.eqiad.wmnet

Not sure what I'm doing wrong:

robh@cumin2002:~$ sudo cookbook sre.hardware.upgrade-firmware -c ssd "db1253.*"
Acquired lock for key /spicerack/locks/cookbooks/sre.hardware.upgrade-firmware: {'concurrency': 20, 'created': '2025-06-12 15:29:29.884984', 'owner': 'robh@cumin2002 [424266]', 'ttl': 1800}
START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts db1253.eqiad.wmnet
Acquired lock for key /spicerack/locks/custom/sre.hardware.upgrade-firmware:db1253: {'concurrency': 1, 'created': '2025-06-12 15:29:30.279506', 'owner': 'robh@cumin2002 [424266]', 'ttl': 3600}
Management Password: 
db1253.eqiad.wmnet (Gen 15): starting
db1253.eqiad.wmnet (SSD): update
db1253.eqiad.wmnet (SSD): current version: 1+dl7a
poweredge-r650xs: picking DellDriverCategory.SSD update file
Released lock for key /spicerack/locks/custom/sre.hardware.upgrade-firmware:db1253: {'concurrency': 1, 'created': '2025-06-12 15:29:30.279506', 'owner': 'robh@cumin2002 [424266]', 'ttl': 3600}
Exception raised while executing cookbook sre.hardware.upgrade-firmware:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 265, in _run
    raw_ret = runner.run()
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 1069, in run
    failures += self._run_host(hostname)
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 1129, in _run_host
    if not self.update_ssd_driver(redfish_host, netbox_host):
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 997, in update_ssd_driver
    target_version, job_id = self._update(
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 608, in _update
    target_version, firmware_file = getattr(self, select_firmwarefile)(
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 569, in _cached_select_firmwarefile
    return self._select_firmwarefile(*args, **kargs)
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 550, in _select_firmwarefile
    return self.get_latest(product_slug, driver_type, driver_category)
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 267, in get_latest
    raise NotImplementedError("SSD firmware fetch from DELL website not yet implemented")
NotImplementedError: SSD firmware fetch from DELL website not yet implemented
Released lock for key /spicerack/locks/cookbooks/sre.hardware.upgrade-firmware: {'concurrency': 20, 'created': '2025-06-12 15:29:29.884984', 'owner': 'robh@cumin2002 [424266]', 'ttl': 1800}
END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts db1253.eqiad.wmnet
robh@cumin2002:~$ ls -lah /srv/firmware/poweredge-r650xs/STORAGE/
total 23M
drwxr-sr-x 2 root datacenter-ops 4.0K Jun 12 15:24 .
drwxr-sr-x 6 root datacenter-ops 4.0K Jun 12 15:24 ..
-rw-r--r-- 1 robh wikidev         23M Jun 12 15:20 Serial-ATA_Firmware_VJPKG_WN64_DL7C_A00.EXE
robh@cumin2002:~$

Bah, fixed, was SSD directory not STORAGE, thanks Riccardo!

Hello, just to let you know, I'm now trying the same operation on an-coord1003 T394499#10927000 and getting the same error as @RobH above: T394543#10909802

sudo cookbook sre.hardware.upgrade-firmware -c ssd an-coord1003.eqiad.wmnet

<snip>

an-coord1003.eqiad.wmnet (SSD): current version: 1+dl70
poweredge-r440: picking DellDriverCategory.SSD update file

<snip>

NotImplementedError: SSD firmware fetch from DELL website not yet implemented

The firmware file appears to be present in the correct directory.

btullis@cumin1003:~$ ls -lah /srv/firmware/poweredge-r440/STORAGE/
total 33M
drwxrwxr-x 2 root datacenter-ops 4.0K May 16 19:05 .
drwxr-xr-x 7 root datacenter-ops 4.0K Feb 13  2023 ..
-rw-r--r-- 1 root root           9.8M Nov 18  2022 Serial-ATA_Firmware_6TG3F_WN32_TT03_A00.EXE
-rw-r--r-- 1 robh wikidev         23M May 16 18:37 Serial-ATA_Firmware_VJPKG_WN64_DL7C_A00.EXE

I tried the same operation on cumin1002 and cumin1003, just in case it was something to do with the sync of the /srv/firmware files between cumin hosts, but there was no difference in the error.

@Volans if you would like to have a look and see if you can figure it out, you'd be very welcome. Or I'm happy to try anything out, if you have suggestions.

an-coord1003 is currently in standby mode, with not automatic failback, so there is no urgency.
We can extend the current downtime and reboot it whenever we like. Thanks.

@BTullis the SSD upgrade is a type of its own, not STORAGE, so the files must be in /srv/firmware/poweredge-r440/SSD. If you use that path it should just work.

Yes we currently don't have an official sync between the cumin hosts. The problem here was that we couldn't use puppet's volatile in order to let dcops manage the files autonomously. We probably need to create some sort of 3-way sync that doesn't conflict much. I'm saying 3 way because while most of the time we just have 2 cumin hosts we do have 3 when upgrading them (like right now) and in this particular instance the old one (cumin1002) will stay around for a while due to some software not yet ready for bookworm.

Let me open a task so we don't loose track of it.

@BTullis the SSD upgrade is a type of its own, not STORAGE, so the files must be in /srv/firmware/poweredge-r440/SSD. If you use that path it should just work.

Yes we currently don't have an official sync between the cumin hosts. The problem here was that we couldn't use puppet's volatile in order to let dcops manage the files autonomously. We probably need to create some sort of 3-way sync that doesn't conflict much. I'm saying 3 way because while most of the time we just have 2 cumin hosts we do have 3 when upgrading them (like right now) and in this particular instance the old one (cumin1002) will stay around for a while due to some software not yet ready for bookworm.

Let me open a task so we don't loose track of it.

Got it! Many thanks. Yes, that appears to have worked now.
For some reason the cookbook exited with a return code of 1, but I think that the firmware update stuck.

Successful Puppet run found
Deleted silence ID 1088cf82-5014-45e4-8881-3a61dc845d0c
Released lock for key /spicerack/locks/cookbooks/sre.hosts.reboot-single:an-coord1003: {'concurrency': 1, 'created': '2025-06-18 10:43:25.990897', 'owner': 'btullis@cumin1003 [2099595]', 'ttl': 600}
END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-coord1003.eqiad.wmnet
[IDRAC.2.7.PR19] Job completed successfully.
Released lock for key /spicerack/locks/custom/sre.hardware.upgrade-firmware:an-coord1003: {'concurrency': 1, 'created': '2025-06-18 10:40:49.946448', 'owner': 'btullis@cumin1003 [2099595]', 'ttl': 3600}
Released lock for key /spicerack/locks/cookbooks/sre.hardware.upgrade-firmware: {'concurrency': 20, 'created': '2025-06-18 10:40:49.880362', 'owner': 'btullis@cumin1003 [2099595]', 'ttl': 1800}
END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts an-coord1003.eqiad.wmnet

I'm not going to look into this too much now, but thought it might be interesting.

The cookbook exited with that code because it had a failure, unfortunately was missing a useful logging message at the right point. I'm adding it in this patch.
Manually checking the disks they both show the correct version, so not 100% sure what happened there, but it could be that they were not (yet?) reporting the new version. If you try to re-run it it does tell you there is nothing to upgrade right?

If you try to re-run it it does tell you there is nothing to upgrade right?

I can confirm that the new version is correctly shown. It still shows me a list of available versions and offers me an option to select it, but that's fine.

btullis@cumin1003:~$ sudo cookbook sre.hardware.upgrade-firmware -c ssd an-coord1003.eqiad.wmnet
Acquired lock for key /spicerack/locks/cookbooks/sre.hardware.upgrade-firmware: {'concurrency': 20, 'created': '2025-06-18 15:03:08.129211', 'owner': 'btullis@cumin1003 [2125299]', 'ttl': 1800}
START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1003.eqiad.wmnet
Acquired lock for key /spicerack/locks/custom/sre.hardware.upgrade-firmware:an-coord1003: {'concurrency': 1, 'created': '2025-06-18 15:03:08.192861', 'owner': 'btullis@cumin1003 [2125299]', 'ttl': 3600}
Management Password: 
an-coord1003.eqiad.wmnet (Gen 14): starting
an-coord1003.eqiad.wmnet (SSD): update
an-coord1003.eqiad.wmnet (SSD): current version: 1+dl7c
poweredge-r440: picking DellDriverCategory.SSD update file
We have found multiple entries please pick from the list below:
0: /srv/firmware/poweredge-r440/SSD/Serial-ATA_Firmware_VJPKG_WN64_DL7C_A00.EXE
1: /srv/firmware/poweredge-r440/SSD/Serial-ATA_Firmware_6TG3F_WN32_TT03_A00.EXE
2: Download new file
==> Please select the entry you want
> ^C==> Invalid response. Please type one of: 0,1,2. After 3 wrong answers the task will be aborted.
> ^C==> Invalid response. Please type one of: 0,1,2. After 3 wrong answers the task will be aborted.
> ^C==> Invalid response. Please type one of: 0,1,2. After 3 wrong answers the task will be aborted.

Thanks again.

yes if you pick the same version (option 0 above) it would just tell you that there is nothing to do because already at the same version. Thanks for checking.