Page MenuHomePhabricator

sre.hardware.upgrade-firmware cookbook: product slug parsing
Open, LowPublic

Description

Most of the netbox device types for the PowerEdge R440s parse to a valid product slug, but a few (11 to be precise) do not; The product slug parsing expects to use either the device_type as-is, or the type after stripping away -config....

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/hardware/upgrade-firmware.py#222

Some do not match:

https://netbox.wikimedia.org/dcim/device-types/?q=PowerEdge%20R440

I'm not sure whether the Right Answer™ is to fix the device types to conform to convention, or to make the product slug parsing more sophisticated.

Event Timeline

Volans added subscribers: wiki_willy, Volans.

IMHO I think we should stick to the agreed format in T284614#7214588 and T284614#7222919 and rename (and re-slug) the 3 non matching ones into the format PowerEdge R440 - ConfigFundraising 202107 and so on. @wiki_willy what do you think?

In addition I noticed that not all follow the same format for the name and just by luck have the correct slug, see: https://netbox.wikimedia.org/dcim/device-types/?q=config
@wiki_willy would it be ok for you if I go and fix the format of the naming in Netbox? Adding the missing dashes, etc...

Lastly we could add a validator to ensure that all newly created device types follow a given standard to prevent inserting names that are not correct.

Change 964890 had a related patch set uploaded (by Jbond; author: jbond):

[operations/cookbooks@master] sre.hardware.upgrade-cookbook: check we get drivers from dell

https://gerrit.wikimedia.org/r/964890

Change 964890 merged by Jbond:

[operations/cookbooks@master] sre.hardware.upgrade-cookbook: check we get drivers from dell

https://gerrit.wikimedia.org/r/964890

My two cents is to fix the issues so that we can stick to the original standard. I agree with Volans.

Given no objections I went ahead and fixed ALL names and slug to adhere to the standard. Triaging as low and leaving the task open to add a validator later.

I've just had a failure to update firmware for a host and a brief search led me to this issue.
The error I got was from an-worker1168 and it seems unable to parse a slug named power

btullis@cumin1002:~$ sudo cookbook sre.hardware.upgrade-firmware an-worker1168.eqiad.wmnet
Acquired lock for key /spicerack/locks/cookbooks/sre.hardware.upgrade-firmware: {'concurrency': 20, 'created': '2024-03-22 10:59:44.025830', 'owner': 'btullis@cumin1002 [2465965]', 'ttl': 1800}
START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-worker1168.eqiad.wmnet
Acquired lock for key /spicerack/locks/custom/sre.hardware.upgrade-firmware:an-worker1168: {'concurrency': 1, 'created': '2024-03-22 10:59:44.087541', 'owner': 'btullis@cumin1002 [2465965]', 'ttl': 3600}
Management Password: 
an-worker1168.eqiad.wmnet (Gen 15): starting
an-worker1168.eqiad.wmnet (IDRAC): update
an-worker1168.eqiad.wmnet (IDRAC): current version: 7.0.30.0
power: picking DellDriverCategory.IDRAC update file
Released lock for key /spicerack/locks/custom/sre.hardware.upgrade-firmware:an-worker1168: {'concurrency': 1, 'created': '2024-03-22 10:59:44.087541', 'owner': 'btullis@cumin1002 [2465965]', 'ttl': 3600}
Exception raised while executing cookbook sre.hardware.upgrade-firmware:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 250, in _run
    raw_ret = runner.run()
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 968, in run
    failures += self._run_host(hostname)
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 998, in _run_host
    self.update_idrac(redfish_host, netbox_host)
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 693, in update_idrac
    target_version, job_id = self._update(
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 596, in _update
    target_version, firmware_file = getattr(self, select_firmwarefile)(
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 557, in _cached_select_firmwarefile
    return self._select_firmwarefile(*args, **kargs)
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 538, in _select_firmwarefile
    return self.get_latest(product_slug, driver_type, driver_category)
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 270, in get_latest
    raise RuntimeError(f"unable to find any drivers for: {product_slug}\n"
RuntimeError: unable to find any drivers for: power
Please ensure that the slug is correct.
Released lock for key /spicerack/locks/cookbooks/sre.hardware.upgrade-firmware: {'concurrency': 20, 'created': '2024-03-22 10:59:44.025830', 'owner': 'btullis@cumin1002 [2465965]', 'ttl': 1800}
END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-worker1168.eqiad.wmnet

It seems like it might be related to this device naming in netbox. https://netbox.wikimedia.org/dcim/device-types/287/

image.png (731×1 px, 88 KB)

It's not blocking me though, so I'm just leaving this here in case it is helpful.

@BTullis indeed, that's another new device type created with the wrong slug. I've updated the slug in Netbox to fix it.