Page MenuHomePhabricator

Broadcom BCM57412 10G NIC and Bullseye installer
Open, MediumPublic

Description

@fgiunchedi reimaged thanos-fe2001 to Debian Bullseye, but d-i didn't bring up the link with Linux 5.10:

	[   10.305584] bnxt_en 0000:af:00.0 enp175s0f0np0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - recei 
	[   10.305590] bnxt_en 0000:af:00.0 enp175s0f0np0: FEC autoneg off encoding: None                                   
	[   12.105509] bnxt_en 0000:af:00.0 enp175s0f0np0: NIC Link is Down             
	[   13.356575] bnxt_en 0000:af:00.0 enp175s0f0np0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - recei 
	[   13.356581] bnxt_en 0000:af:00.0 enp175s0f0np0: FEC autoneg off encoding: None                                   
	[   13.356635] IPv6: ADDRCONF(NETDEV_CHANGE): enp175s0f0np0: link becomes ready 
	[   13.745342] bnxt_en 0000:af:00.0 enp175s0f0np0: NIC Link is Down

The installation only worked after the NIC firmware was updated following https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#NICs

Given that have over 400 servers with that type of NIC (full list at https://phabricator.wikimedia.org/P16824), opening this task for

  • better visibility
  • more debugging and to track if this applies to all such installations
  • if so, eventually automating the NIC firmware update

Event Timeline

Legoktm triaged this task as Medium priority.Jul 26 2021, 11:16 PM

Mentioned in SAL (#wikimedia-operations) [2021-08-12T12:04:06Z] <godog> upgrade NIC firmware on thanos-fe100[12] - T286722

Mentioned in SAL (#wikimedia-operations) [2021-08-12T12:08:41Z] <godog> upgrade NIC firmware on thanos-fe100[34] - T286722

Mentioned in SAL (#wikimedia-operations) [2021-08-12T12:09:26Z] <godog> upgrade NIC firmware on thanos-be1* - T286722

Mentioned in SAL (#wikimedia-operations) [2021-08-12T12:43:48Z] <godog> upgrade NIC firmware on thanos-be2* / thanos-fe2* - T286722

This also affects backup hosts- at the very least backup2002- most likely others backup[12]00[123].

Mentioned in SAL (#wikimedia-operations) [2022-04-28T11:35:44Z] <jynus> applying NIC firmware update onto backup2002 T286722

For logging, the exact firmware I am using for backup2002 is: NetXtreme-E Network Device Firmware 22.0 Version: 22.00.07.60, 22.00.07.60 File: Network_Firmware_NPNT5_WN64_22.00.07.60.EXE Date: 9 Mar 2022

The card was: Broadcom Adv. Dual 10Gb Ethernet (x2) (Broadcom NetXtreme Gigabit Ethernet)

It doesn't seem to have worked (but network itself seems to work with the older kernel).

@Papaul @MoritzMuehlenhoff I am a bit lost now, as apparently the NIC firmware upgrade didn't fix my issue, as it did for Filippo. Should we try upgrading more firmwares (e.g. BIOS)? The same -older- version he tried? It could be some other issue? Any advice is welcome.

backup2002 is up right now with the old kernel.

@jcrespo we have seen some issues on version 22 on those cards with some of the cloud nodes. Maybe tried to downgrade the firmware to version 21.80.9 and let me know.

Thank you, @Papaul, that's exactly what I hoped to get- some insight from people that may had more experience with similar issues, to try something that could fix the issue. Will do as suggested and update results here.

The firmware update worked- just I don't know which version (I am currently using 21.80.16.95). The problem was that, after it had first failed, when installing an older kernel, the network device changed name and I didn't notice I was getting link on dmesg now (but it was missconfigured). I am sure that *some* firmware update was needed, as I checked I didn't get link at first, plus the debian installer also failed to dhcp- but maybe the original update (22.00.07.60) was enough.

Heads up to @MoritzMuehlenhoff as I belive he was bitten by the stable device name change for a NIC, too :-/.

I updated the firmware of other backup affected hosts: backup2002, backup1001, backup2001 to 21.80.16.95. They all seem to work as expected and was able to upgrade them to bullseye.