Page MenuHomePhabricator

SSD firmware update for an-coord100[3-4]
Closed, ResolvedPublic

Description

We would like to update the SSD firmware on these hosts, please.

  • an-coord1003.eqiad.wmnet
  • an-coord1004.eqiad.wmnet

We can start with an-coord1004, as this is currently operating only in a standby capacity.

We can then migrate the Hive metastore and Presto coordinator services to an-coord1004 in order to reduce the impact of downtime for an-coord1003.

an-coord1003:

  • - schedule downtime for host with service owners and icinga
  • - note old firmware version
Disk 0 on Embedded AHCI Controller 1 	DL70
Disk 1 on Embedded AHCI Controller 1 	DL70
  • - send firmware update cumin cookbook command: cookbook sre.hardware.upgrade-firmware -c ssd "hostname*" and select option 0 for DL7C. - THIS WILL REQUIRE THE HOST TO REBOOT
  • - confirm firmware updated to correct version on all affected SSDs
  • - pass system back to service owners/service use

an-coord1004:

  • - schedule downtime for host with service owners and icinga
  • - note old firmware version
  • - send firmware update cumin cookbook command: cookbook sre.hardware.upgrade-firmware -c ssd "hostname*" and select option 0 for DL7C.
  • - confirm firmware updated to correct version on all affected SSDs
  • - pass system back to service owners/service use

Details

Due Date
Aug 29 2025, 12:00 AM
Related Changes in Gerrit:

Event Timeline

@BTullis,

Thank you! I was trying to get a volunteer to let me test the first firmware updates before I roll out documentation. I'll work on an-coord1004.eqiad.wmnet, as long as you can confirm its ok for me to take over and reboot it at will?

@BTullis,

Thank you! I was trying to get a volunteer to let me test the first firmware updates before I roll out documentation. I'll work on an-coord1004.eqiad.wmnet, as long as you can confirm its ok for me to take over and reboot it at will?

Hi @RobH - yes, that's fine. I just double-checked and an-coord1004 is definitely on standby at the moment.

It would need a DNS change to fail over any services to it, so it won't happen automatically. You should be good to go with the firmware update and can reboot it at will.

RobH triaged this task as Medium priority.May 16 2025, 7:22 PM
RobH updated the task description. (Show Details)

Updated firmware on the idrac interface to 7 and then from DL70 to DL7C on both disks.

The update cookbook failed, so did it manually and found out a reboot is required for the firmware update.

Next steps: get cookbook working via T394543 before doing the rest. Reboot is required.

RobH updated the task description. (Show Details)
RobH set Due Date to Aug 29 2025, 12:00 AM.

@BTullis,

With the successful update of the cookbook, an-coord1004 can now be scheduled for downtime and update. The downtime is about 15minutes or so for the cookbook to run and return to OS ready state. No data loss, and can be done via the cookbook by anyone.

Please let me know if you would like to handle this on an-coord1004 directly or if you would like to schedule a downtime and have me run the command(s) (see checklist).

Overall we're hoping to have all of these done and addressed by end of August 2025 (so two months for a non-expedited rollout.). If this timeline isn't feasible, please let us know. This is not an urgent unbreak now, but we prefer to get this done before we start seeing SSD failures.

Hi @RobH - Sorry, I'm not 100% clear on which host you would like me to proceed. The description at the top says an-coord1003 twice, but then it says that an-coord1004 is completed and T394499#10831197 states that you have done it manually.

Is it an-coord1003 that you would like me to attempt with the cookbook? I'm happy to have a go myself, but just wanted to clarify this before proceeding. Thanks.

RobH updated the task description. (Show Details)

Hi @RobH - Sorry, I'm not 100% clear on which host you would like me to proceed. The description at the top says an-coord1003 twice, but then it says that an-coord1004 is completed and T394499#10831197 states that you have done it manually.

Is it an-coord1003 that you would like me to attempt with the cookbook? I'm happy to have a go myself, but just wanted to clarify this before proceeding. Thanks.

Yeah... my bad that is totally confusing due to typos and misparsing in my last update. an-coord1004 is completed, an-coord1003 needs to be updated. I've corrected the task descriptions, sorry for the confusion!

Change #1160222 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Failover hive and presto to the standby coordinator

https://gerrit.wikimedia.org/r/1160222

Change #1160222 merged by Btullis:

[operations/dns@master] Failover hive and presto to the standby coordinator

https://gerrit.wikimedia.org/r/1160222

Icinga downtime and Alertmanager silence (ID=5ea14561-c44c-4cc5-b656-024e47b3bc03) set by btullis@cumin1003 for 1:00:00 on 1 host(s) and their services with reason: Upgrading SSD firmware

an-coord1003.eqiad.wmnet

Hi @RobH the cookbook failed for an-coord1003 with the following error:

btullis@cumin1003:~$ sudo cookbook sre.hardware.upgrade-firmware -c ssd an-coord1003.eqiad.wmnet
Acquired lock for key /spicerack/locks/cookbooks/sre.hardware.upgrade-firmware: {'concurrency': 20, 'created': '2025-06-18 08:42:17.761492', 'owner': 'btullis@cumin1003 [2085773]', 'ttl': 1800}
START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-coord1003.eqiad.wmnet
Acquired lock for key /spicerack/locks/custom/sre.hardware.upgrade-firmware:an-coord1003: {'concurrency': 1, 'created': '2025-06-18 08:42:17.825836', 'owner': 'btullis@cumin1003 [2085773]', 'ttl': 3600}
Management Password: 
an-coord1003.eqiad.wmnet (Gen 14): starting
an-coord1003.eqiad.wmnet (SSD): update
an-coord1003.eqiad.wmnet (SSD): current version: 1+dl70
poweredge-r440: picking DellDriverCategory.SSD update file
Released lock for key /spicerack/locks/custom/sre.hardware.upgrade-firmware:an-coord1003: {'concurrency': 1, 'created': '2025-06-18 08:42:17.825836', 'owner': 'btullis@cumin1003 [2085773]', 'ttl': 3600}
Exception raised while executing cookbook sre.hardware.upgrade-firmware:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 265, in _run
    raw_ret = runner.run()
              ^^^^^^^^^^^^
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 1069, in run
    failures += self._run_host(hostname)
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 1129, in _run_host
    if not self.update_ssd_driver(redfish_host, netbox_host):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 997, in update_ssd_driver
    target_version, job_id = self._update(
                             ^^^^^^^^^^^^^
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 608, in _update
    target_version, firmware_file = getattr(self, select_firmwarefile)(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 569, in _cached_select_firmwarefile
    return self._select_firmwarefile(*args, **kargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 550, in _select_firmwarefile
    return self.get_latest(product_slug, driver_type, driver_category)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 267, in get_latest
    raise NotImplementedError("SSD firmware fetch from DELL website not yet implemented")
NotImplementedError: SSD firmware fetch from DELL website not yet implemented
Released lock for key /spicerack/locks/cookbooks/sre.hardware.upgrade-firmware: {'concurrency': 20, 'created': '2025-06-18 08:42:17.761492', 'owner': 'btullis@cumin1003 [2085773]', 'ttl': 1800}
END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-coord1003.eqiad.wmnet

I'll have a quick look to see if I can work around this easily, but I won't do anything rash. Perhaps the --firmware-store option can help.
I'm happy to leave things in the failed-over state, where an-coord1004 is serving both hive and presto services and an-coord1003 is effectively idle.

Icinga downtime and Alertmanager silence (ID=49f11e46-f52a-4db8-a2cc-7688a3599023) set by btullis@cumin1003 for 1:00:00 on 1 host(s) and their services with reason: Upgrading SSD firmware

an-coord1003.eqiad.wmnet
BTullis updated the task description. (Show Details)