Page MenuHomePhabricator

clouddb1021 missing network firmware bnx2x/bnx2x-e2-7.13.21.0.fw in Debian 11 Bullseye
Closed, ResolvedPublic

Description

While doing a reimage for https://phabricator.wikimedia.org/T299480, the installed prompted me with the following message:

Some of your hardware needs non-free firmware files to operate. The firmware can be loaded from removable media, such as a USB stick or floppy.
The missing firmware files are: bnx2x/bnx2x-e2-7.13.21.0.fw bnx2x/bnx2x-e2-7.13.21.0.fw

Full output: https://phabricator.wikimedia.org/P24619

dmesg

(output is somewhat garbled, feel free to ssh in to clouddb.mgmt.eqiad.wmnet and choose "Execute a shell" and run dmesg yourself)

https://phabricator.wikimedia.org/P24621

Host information:

~ # uname -a
Linux (none) 5.10.0-13-amd64 #1 SMP Debian 5.10.106-1 (2022-03-17) x86_64 GNU/Linux

Other notes

I tried doing a ping and it didn't work, and also printed the network state (though the output is garbled in this paste)

~ # ping 10.64.48.11
PING 10.64.48.11 (10.64.48.11): 56 data bytes
ping: sendto: Network is unreachable
~ # ip a sh
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST> mtu 1500 qdisc mq qlen 1000
    link/ether b8:83:03:53:f0:48 brd ff:ff:ff:ff:ff:ff
3: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc mq qlen 1000
    link/ether b8:83:03:53:f0:49 brd ff:ff:ff:ff:ff:ff
4: eno3: <BROADCAST,MULTICAST> mtu 1500 qdisc mq qlen 1000
    link/ether b8:83:03:53:f0:4a brd ff:ff:ff:ff:ff:ff
5: eno4: <BROADCAST,MULTICAST> mtu 1500 qdisc mq qlen 1000

Fortunately this host is only used at the start of the month to import data from mysql to hdfs, so we have until May 1 for this to be an issue. With that being said, I have no idea how to resolve this, and it might require physical installation media as the prompt suggests.

We could also try reimaging this back to Debian 10 Buster for now.

Event Timeline

Possibly relevant links thanks to @jhathaway:

https://packages.debian.org/bullseye/firmware-bnx2x does not contain the requested version (bnx2x/bnx2x-e2-7.13.21.0.fw)

Bug with a different kernel version: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1006500 (this is Linux 5.16 and the host has 5.10.106-1 now)

@Marostegui says I should tag @MoritzMuehlenhoff - hopefully we can all solve this together :)

razzi renamed this task from clouddb1021 missing firmware: debian installer cannot connect to network to clouddb1021 missing firmware; debian installer cannot connect to network.Apr 14 2022, 6:50 AM

FYI this host is alerting in Netbox:

	clouddb1021 	Device is Active in Netbox but is missing from PuppetDB (should be ('inventory', 'offline', 'planned', 'decommissioning', 'failed'))

I'm setting it to "failed", please set it back to Active when able.

@Marostegui says I should tag @MoritzMuehlenhoff - hopefully we can all solve this together :)

If this is going to take long, we should probably reimage that host back to Buster while we figure out next steps, so we don't get mysql replication behind for many days.

Once clouddb1021 is reimaged please run: maintain-views --database enwiki --table flaggedtemplates --replace as this is pending from T297189

Cookbook cookbooks.sre.hosts.reimage was started by razzi@cumin1001 for host clouddb1021.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by razzi@cumin1001 for host clouddb1021.eqiad.wmnet with OS buster completed:

  • clouddb1021 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204181932_razzi_2910212_clouddb1021.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

I reimaged the host back to Buster for now, which went smoothly. Replication lag is a few days behind but is catching up gradually: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=clouddb1021&var-port=13311&viewPanel=6&from=now-5m&to=now

I also ran maintain-views which succeeded with the following output.

razzi@clouddb1021:~$ sudo maintain-views --database enwiki --table flaggedtemplates --replace
2022-04-18 20:11:48,833 INFO Full views for enwiki:
2022-04-18 20:11:48,835 INFO [enwiki_p.flaggedtemplates]
2022-04-18 20:11:48,838 INFO Custom views for enwiki:
razzi renamed this task from clouddb1021 missing firmware; debian installer cannot connect to network to clouddb1021 missing network firmware bnx2x/bnx2x-e2-7.13.21.0.fw in Debian 11 Bullseye.Apr 18 2022, 8:18 PM

https://bugzilla.kernel.org/show_bug.cgi?id=215627 is specifically for a bug in 5.15 and doesn't affect us.

The error message is a bit of a red herring, the tg3 driver tries to load firmware if available, but works fine without (the firmware is only needed for some additonal features). We already have a few servers running Debian Bullseye with that NIC, e.g. cloudnet1003:

[    8.206891] bnx2x 0000:04:00.0: firmware: failed to load bnx2x/bnx2x-e2-7.13.21.0.fw (-2)
[    8.260616] firmware_class: See https://wiki.debian.org/Firmware for information about missing firmware
[    8.307300] bnx2x 0000:04:00.0: Direct firmware load for bnx2x/bnx2x-e2-7.13.21.0.fw failed with error -2
[    8.353016] bnx2x 0000:04:00.0: firmware: direct-loading firmware bnx2x/bnx2x-e2-7.13.15.0.fw

The Debian installer in bullseye introduced some changes to better deal with the fallout of missing firmware: On some GPUs people were only seeing a black screen since the firmware in question was even required to present the info screen. To mitigate that there's now code which actively detects such cases and prompts a warning. This was added in hw-detect https://tracker.debian.org/news/1245038/accepted-hw-detect-1145-source-into-unstable/ (hw-detect is a d-i component)

The case of the tg3 firmware is even mentioned there explicitly: https://www.debian.org/releases/bullseye/amd64/ch02s02 So it might be a case where the modalias information tells hw-detect that clouddb1021 requires the firmware, while in fact is doesn't. One puzzling this is that we haven't seen this before. Or maybe this also happened as part of the cloudnet* install and this was simply interactively acknowledged in d-i over the serial console (by replying to the "Load missing firmware from removable media?" with "No"). I'll do some spelunking in the hw-detect code to see if there's a debconf preseed option to skip that question entirely.

Relatedly (coindicently posted today!) Debian has also bootstrapped the discussion to finally address the whole firmware handling more sensibly : https://blog.einval.com/2022/04/19

Change 784259 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Don't prompt for loading additional firmware in d-i

https://gerrit.wikimedia.org/r/784259

The case of the tg3 firmware is even mentioned there explicitly: https://www.debian.org/releases/bullseye/amd64/ch02s02 So it might be a case where the modalias information tells hw-detect that clouddb1021 requires the firmware, while in fact is doesn't. One puzzling this is that we haven't seen this before. Or maybe this also happened as part of the cloudnet* install and this was simply interactively acknowledged in d-i over the serial console (by replying to the "Load missing firmware from removable media?" with "No"). I'll do some spelunking in the hw-detect code to see if there's a debconf preseed option to skip that question entirely.

I think https://gerrit.wikimedia.org/r/c/operations/puppet/+/784259/ should resolve this. Let me know when you want to re-attempt a bullseye install of clouddb1021, then I can deploy that change beforehand.

I did the reimage again just now and it worked fine selecting "No" when prompted to load missing firmware. @MoritzMuehlenhoff I misread your comment and didn't realize your change should have been submitted first, sorry!! Let me know if I can still be useful in testing that, but otherwise, this ticket can be closed. Thanks for your input everybody.

Ack, I feel confident that the patch does the right thing, so I'll go ahead and merge. If you have another clouddb to reimage with the same NIC it should no longer query for that interactive prompt.

Change 784259 merged by Muehlenhoff:

[operations/puppet@production] Don't prompt for loading additional firmware in d-i

https://gerrit.wikimedia.org/r/784259

razzi claimed this task.

If the device is no more in a failed state please update its Netbox status.

Updated netbox status to "Active".