Page MenuHomePhabricator

cookbook sre.hardware.upgrade-firmware nic firmware comparison mismatch
Closed, ResolvedPublic

Description

When running the NIC firmware updates via cookbook, I noticed it isn't comparing the two versions quite right. The end result is the script thinks something is wrong and attempts the upload twice, and while it uploads correctly, the comparison fails.

Short comparison string example: ganeti5005 (NETWORK): Something went wrong, the current version (21.85.21.92) does not match the most target (85.21.92)

Full script run:

robh@cumin2002:~$ sudo cookbook sre.hardware.upgrade-firmware --new --component nic ganeti5005
START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti5005']
Management Password: 
ganeti5005.eqsin.wmnet (Gen 14): starting
ganeti5005.eqsin.wmnet (NETWORK): update
poweredge-r440: picking DellDriverCategory.NETWORK update file
We have found multiple entries please pick from the list below:
0: /srv/firmware/poweredge-r440/NETWORK/Network_Firmware_RXP80_WN64_21.85.21.92.EXE
1: /srv/firmware/poweredge-r440/NETWORK/Network_Firmware_230WD_WN64_22.21.07.80_01.EXE
2: /srv/firmware/poweredge-r440/NETWORK/Network_Firmware_DFF48_WN64_22.00.6.EXE
3: Download new file
==> Please select the entry you want
> 0
ganeti5005.eqsin.wmnet (NETWORK): target_version: 85.21.92, current_version: 22.0.6
==> ganeti5005.eqsin.wmnet NETWORK: About to upload /srv/firmware/poweredge-r440/NETWORK/Network_Firmware_RXP80_WN64_21.85.21.92.EXE, please confirm
Type "go" to proceed or "abort" to interrupt the execution
> go
==> ganeti5005.eqsin.wmnet NETWORK: About to install Available-107815-21.85.21.92__NIC.Mezzanine.1-2-1, please confirm
Type "go" to proceed or "abort" to interrupt the execution
> go
ganeti5005.eqsin.wmnet (NETWORK): has job ID - /redfish/v1/TaskService/Tasks/JID_704026221903
==> ganeti5005: About to reboot to apply update, please confirm
Type "go" to proceed or "abort" to interrupt the execution
> go
Resetting chassis power status for ganeti5005 to ForceRestart
[IDRAC.2.5.RED002] Package successfully downloaded.
[1/30, retrying in 30.00s] Polling task: JID_704026221903 not completed yet: status=OK, state=Pending, completed=None%
[IDRAC.2.5.JCP001] Task successfully scheduled.
[2/30, retrying in 30.00s] Polling task: JID_704026221903 not completed yet: status=OK, state=Starting, completed=0%
[IDRAC.2.5.JCP001] Task successfully scheduled.
[3/30, retrying in 30.00s] Polling task: JID_704026221903 not completed yet: status=OK, state=Starting, completed=0%
[IDRAC.2.5.PR20] Job in progress.
[4/30, retrying in 30.00s] Polling task: JID_704026221903 not completed yet: status=OK, state=Running, completed=1%
[IDRAC.2.5.PR20] Job in progress.
[5/30, retrying in 30.00s] Polling task: JID_704026221903 not completed yet: status=OK, state=Running, completed=1%
[IDRAC.2.5.PR20] Job in progress.
[6/30, retrying in 30.00s] Polling task: JID_704026221903 not completed yet: status=OK, state=Running, completed=1%
[IDRAC.2.5.PR20] Job in progress.
[7/30, retrying in 30.00s] Polling task: JID_704026221903 not completed yet: status=OK, state=Running, completed=1%
[IDRAC.2.5.PR20] Job in progress.
[8/30, retrying in 30.00s] Polling task: JID_704026221903 not completed yet: status=OK, state=Running, completed=1%
[IDRAC.2.5.PR20] Job in progress.
[9/30, retrying in 30.00s] Polling task: JID_704026221903 not completed yet: status=OK, state=Running, completed=1%
[IDRAC.2.5.PR19] Job completed successfully.
ganeti5005 (NETWORK): now at version: 22.0.6
ganeti5005 (NETWORK): Something went wrong, the current version (22.0.6) does not match the most target (85.21.92)
ganeti5005.eqsin.wmnet (NETWORK): update
ganeti5005.eqsin.wmnet (NETWORK): target_version: 85.21.92, current_version: 21.85.21.92
==> ganeti5005.eqsin.wmnet NETWORK: About to upload /srv/firmware/poweredge-r440/NETWORK/Network_Firmware_RXP80_WN64_21.85.21.92.EXE, please confirm
Type "go" to proceed or "abort" to interrupt the execution
> go
==> ganeti5005.eqsin.wmnet NETWORK: About to install Available-107815-21.85.21.92__NIC.Mezzanine.1-1-1, please confirm
Type "go" to proceed or "abort" to interrupt the execution
> go
ganeti5005.eqsin.wmnet (NETWORK): has job ID - /redfish/v1/TaskService/Tasks/JID_704030142917
==> ganeti5005: About to reboot to apply update, please confirm
Type "go" to proceed or "abort" to interrupt the execution
> 
> go
Resetting chassis power status for ganeti5005 to ForceRestart
[IDRAC.2.5.RED110] Downloading the redfish_upload_file.EXE update package.
[1/30, retrying in 30.00s] Polling task: JID_704030142917 not completed yet: status=OK, state=Pending, completed=None%
[IDRAC.2.5.JCP001] Task successfully scheduled.
[2/30, retrying in 30.00s] Polling task: JID_704030142917 not completed yet: status=OK, state=Starting, completed=0%
[IDRAC.2.5.JCP001] Task successfully scheduled.
[3/30, retrying in 30.00s] Polling task: JID_704030142917 not completed yet: status=OK, state=Starting, completed=0%
[IDRAC.2.5.PR20] Job in progress.
[4/30, retrying in 30.00s] Polling task: JID_704030142917 not completed yet: status=OK, state=Running, completed=1%
[IDRAC.2.5.PR20] Job in progress.
[5/30, retrying in 30.00s] Polling task: JID_704030142917 not completed yet: status=OK, state=Running, completed=1%
[IDRAC.2.5.PR20] Job in progress.
[6/30, retrying in 30.00s] Polling task: JID_704030142917 not completed yet: status=OK, state=Running, completed=1%
[IDRAC.2.5.PR20] Job in progress.
[7/30, retrying in 30.00s] Polling task: JID_704030142917 not completed yet: status=OK, state=Running, completed=1%
[IDRAC.2.5.PR20] Job in progress.
[8/30, retrying in 30.00s] Polling task: JID_704030142917 not completed yet: status=OK, state=Running, completed=1%
[IDRAC.2.5.PR20] Job in progress.
[9/30, retrying in 30.00s] Polling task: JID_704030142917 not completed yet: status=OK, state=Running, completed=1%
[9/30, retrying in 30.00s] Polling task: JID_704030142917 not completed yet: status=OK, state=Running, completed=1%
[IDRAC.2.5.PR19] Job completed successfully.
ganeti5005 (NETWORK): now at version: 21.85.21.92
ganeti5005 (NETWORK): Something went wrong, the current version (21.85.21.92) does not match the most target (85.21.92)
END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti5005']

Event Timeline

RobH triaged this task as Medium priority.

It is also a bit odd that the end lines report something is wrong, but then gives it an END (PASS) on the following line.

In reality, it does pass by updating the correct firmware. However, the script thinks it has a version mismatch, which I think should result in a fail not a pass.

So IMO two things should happen:

  • update script so any 'went wrong' lines result in an END (FAIL) or something other than an END (PASS)
  • update script so version comparison doesn't accidentally cut off the preceding (in this case) 21. It seems like the first 21. or 22. is being cut from comparison string field.

Change 867538 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/cookbooks@master] sre.hardware: support 4 diget version numbers for network drivers

https://gerrit.wikimedia.org/r/867538

Change 867544 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/cookbooks@master] sre.hardware.upgrade-firmware: return status of the cookbook

https://gerrit.wikimedia.org/r/867544

Change 867538 merged by jenkins-bot:

[operations/cookbooks@master] sre.hardware: support 4 digit version numbers for network drivers

https://gerrit.wikimedia.org/r/867538

ganeti4007.ulsfo.wmnet (NETWORK): target_version: 21.85.21.92, current_version: 22.0.6
ganeti4007.ulsfo.wmnet (NETWORK): current version 22.0.6 is ahead of target version 21.85.21.92, use force to downgrade
Resetting chassis power status for ganeti4007 to ForceOff
END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti4007']

Now it won't allow a downgrade because its older, but that is something we have to do for every new Dell, downgrade to older NIC firmware that doesn't break PXE image loading.

Can this be adjusted to allow any application regardless of version comparsion and only use comparison to stop overwriting the same? If its older, perhaps a confirm dialog but honestly we have to downgrade every single dell 10/25G NIC that arrives.

ganeti4007.ulsfo.wmnet (NETWORK): target_version: 21.85.21.92, current_version: 22.0.6
ganeti4007.ulsfo.wmnet (NETWORK): current version 22.0.6 is ahead of target version 21.85.21.92, use force to downgrade
Resetting chassis power status for ganeti4007 to ForceOff
END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti4007']

Now it won't allow a downgrade because its older, but that is something we have to do for every new Dell, downgrade to older NIC firmware that doesn't break PXE image loading.

Can this be adjusted to allow any application regardless of version comparsion and only use comparison to stop overwriting the same? If its older, perhaps a confirm dialog but honestly we have to downgrade every single dell 10/25G NIC that arrives.

nm, use force to downgrade, its in there.

what's the full command you are using robh?

what's the full command you are using robh?

Full command for what exactly? The downgrade issue is fixed with the commented info:

ganeti4007.ulsfo.wmnet (NETWORK): target_version: 21.85.21.92, current_version: 22.0.6
ganeti4007.ulsfo.wmnet (NETWORK): current version 22.0.6 is ahead of target version 21.85.21.92, use force to downgrade
Resetting chassis power status for ganeti4007 to ForceOff
END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti4007']

Now it won't allow a downgrade because its older, but that is something we have to do for every new Dell, downgrade to older NIC firmware that doesn't break PXE image loading.

Can this be adjusted to allow any application regardless of version comparsion and only use comparison to stop overwriting the same? If its older, perhaps a confirm dialog but honestly we have to downgrade every single dell 10/25G NIC that arrives.

nm, use force to downgrade, its in there.

Also I've successfully used the script since its update so I think this can close.

jbond claimed this task.

Full command for what exactly? The downgrade issue is fixed with the commented info:

i meant the full cookbook command that you ran however...

so I've successfully used the script since its update so I think this can close.

no need now :)

Change 867544 merged by jenkins-bot:

[operations/cookbooks@master] sre.hardware.upgrade-firmware: return status of the cookbook

https://gerrit.wikimedia.org/r/867544