Page MenuHomePhabricator

Broadcom BCM57412 10G NIC and Bullseye installer
Open, MediumPublic

Description

@fgiunchedi reimaged thanos-fe2001 to Debian Bullseye, but d-i didn't bring up the link with Linux 5.10:

	[   10.305584] bnxt_en 0000:af:00.0 enp175s0f0np0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - recei 
	[   10.305590] bnxt_en 0000:af:00.0 enp175s0f0np0: FEC autoneg off encoding: None                                   
	[   12.105509] bnxt_en 0000:af:00.0 enp175s0f0np0: NIC Link is Down             
	[   13.356575] bnxt_en 0000:af:00.0 enp175s0f0np0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - recei 
	[   13.356581] bnxt_en 0000:af:00.0 enp175s0f0np0: FEC autoneg off encoding: None                                   
	[   13.356635] IPv6: ADDRCONF(NETDEV_CHANGE): enp175s0f0np0: link becomes ready 
	[   13.745342] bnxt_en 0000:af:00.0 enp175s0f0np0: NIC Link is Down

The installation only worked after the NIC firmware was updated following https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#NICs

Given that have over 400 servers with that type of NIC (full list at https://phabricator.wikimedia.org/P16824), opening this task for

  • better visibility
  • more debugging and to track if this applies to all such installations
  • if so, eventually automating the NIC firmware update

Event Timeline

Legoktm triaged this task as Medium priority.Jul 26 2021, 11:16 PM

Mentioned in SAL (#wikimedia-operations) [2021-08-12T12:04:06Z] <godog> upgrade NIC firmware on thanos-fe100[12] - T286722

Mentioned in SAL (#wikimedia-operations) [2021-08-12T12:08:41Z] <godog> upgrade NIC firmware on thanos-fe100[34] - T286722

Mentioned in SAL (#wikimedia-operations) [2021-08-12T12:09:26Z] <godog> upgrade NIC firmware on thanos-be1* - T286722

Mentioned in SAL (#wikimedia-operations) [2021-08-12T12:43:48Z] <godog> upgrade NIC firmware on thanos-be2* / thanos-fe2* - T286722

This also affects backup hosts- at the very least backup2002- most likely others backup[12]00[123].

Mentioned in SAL (#wikimedia-operations) [2022-04-28T11:35:44Z] <jynus> applying NIC firmware update onto backup2002 T286722

For logging, the exact firmware I am using for backup2002 is: NetXtreme-E Network Device Firmware 22.0 Version: 22.00.07.60, 22.00.07.60 File: Network_Firmware_NPNT5_WN64_22.00.07.60.EXE Date: 9 Mar 2022

The card was: Broadcom Adv. Dual 10Gb Ethernet (x2) (Broadcom NetXtreme Gigabit Ethernet)

It doesn't seem to have worked (but network itself seems to work with the older kernel).

@Papaul @MoritzMuehlenhoff I am a bit lost now, as apparently the NIC firmware upgrade didn't fix my issue, as it did for Filippo. Should we try upgrading more firmwares (e.g. BIOS)? The same -older- version he tried? It could be some other issue? Any advice is welcome.

backup2002 is up right now with the old kernel.

@jcrespo we have seen some issues on version 22 on those cards with some of the cloud nodes. Maybe tried to downgrade the firmware to version 21.80.9 and let me know.

Thank you, @Papaul, that's exactly what I hoped to get- some insight from people that may had more experience with similar issues, to try something that could fix the issue. Will do as suggested and update results here.

The firmware update worked- just I don't know which version (I am currently using 21.80.16.95). The problem was that, after it had first failed, when installing an older kernel, the network device changed name and I didn't notice I was getting link on dmesg now (but it was missconfigured). I am sure that *some* firmware update was needed, as I checked I didn't get link at first, plus the debian installer also failed to dhcp- but maybe the original update (22.00.07.60) was enough.

Heads up to @MoritzMuehlenhoff as I belive he was bitten by the stable device name change for a NIC, too :-/.

I updated the firmware of other backup affected hosts: backup2002, backup1001, backup2001 to 21.80.16.95. They all seem to work as expected and was able to upgrade them to bullseye.

Adding to this task in case it helps someone else; thanks to @fgiunchedi and @jcrespo for documenting the original findings.

We ran into the same issue (PXE boot works fine then DHCP fails in the d-i) on the cp hosts bullseye upgrade (T286722). I discovered this ticket and noticed the outdated firmware so I thought I would try to update it and it worked!

We should perhaps come up with a better plan to upgrade all the firmwares as we should be prepared in moving to bullseye soon -- the cp hosts have already started as an example -- and I am happy to help with that. I am sure there is a better way than manually updating hosts via the HTTP management interface so I am all ears.

Thanks for all the help.

@ssingh there is the sre.hardware.upgrade-firmware cookbook that is already being used by DCOps. It's still in its early stages but it works ;) Feel free to ping @jbond and @Papaul for more details on how to use it.

The upgrade-firmware cookbook gets seems to get unexpected data from logstash102[78]: sudo cookbook sre.hardware.upgrade-firmware logstash1028 -c nic --new

logstash1028.eqiad.wmnet (Gen 14): starting
logstash1028.eqiad.wmnet (NETWORK): update
Exception raised while executing cookbook sre.hardware.upgrade-firmware:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 234, in run
    raw_ret = runner.run()
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 871, in run
    self.update_driver(
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 819, in update_driver
    latest_version, job_id = self._update(
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 584, in _update
    current_version = self.get_version(
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 310, in get_version
    return self._get_version_odata(redfish_host, driver_category, odata_id)
  File "/srv/deployment/spicerack/cookbooks/sre/hardware/upgrade-firmware.py", line 288, in _get_version_odata
    return version.parse(odata_version)
  File "/usr/lib/python3/dist-packages/packaging/version.py", line 57, in parse
    return Version(version)
  File "/usr/lib/python3/dist-packages/packaging/version.py", line 296, in __init__
    match = self._regex.search(version)
TypeError: expected string or bytes-like object

Upgrading the iDRAC software seems to fix the exception. Previously, the host was on iDRAC 4. Now: T324606