Page MenuHomePhabricator

cloudvirt1024.eqiad.wmnet DHCP problems
Closed, ResolvedPublic

Description

After the hack described on T303296, cloudvirt1024 DOES pxe boot and load the debian installer.

The debian installer fails early on, with this message:

Network autoconfiguration failed                    │    
   │ Your network is probably not using the DHCP protocol. Alternatively,  │    
   │ the DHCP server may be slow or some network hardware is not working   │    
   │ properly.

Event Timeline

This host is currently booted into the hdd install for debugging purposes. It's drained of VMs so can be rebooted at any time.

I had a look at that, the switch port see the interface down when in D-I, and when D-I is trying DHCP.

So at this point I'd guess it's a driver/firmware issue where the 10G NIC doesn't work with Bullseye.

Maybe @MoritzMuehlenhoff have some ideas there?

FYI, the 10G NIC is a "540-BBZI : QLogic FastLinQ 41112 Dual Por t 10GbE SFP+ Adapter, PCIe Low Profile"

Let's upgrade the firmware here, we've seen similar upgrade/reimage failures where newer kernels were more sensitive to outdated firmware before, e.g. https://phabricator.wikimedia.org/T296856

Aklapper removed a subscriber: ops-eqiad.

[Please add project tags under project tags instead of subscribers - thanks!]

[Please add project tags under project tags instead of subscribers - thanks!]

Off topic, but should we have a bot to do that automatically? I see it happening quite reqularly.

updated the firmware for: idrac, bios, network cards

After firmware upgrades, the behavior is somewhat worse; pxe boot fails entirely now (although dhcp seems to still be working!)

Booting from BRCM MBA Slot AF00 v220.0.2.0

Broadcom UNDI PXE-2.1 v220.0.2.0
Copyright (C) 2000-2021 Broadcom Limited
Copyright (C) 1997-2000 Intel Corporation
All rights reserved.

CLIENT MAC ADDR: B0 26 28 29 5D F0  GUID: 4C4C4544-005A-5910-805A-C4C04F515032
CLIENT IP: 10.64.20.43  MASK: 255.255.255.0  DHCP IP: 208.80.154.32
GATEWAY IP: 10.64.20.1
      
PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al

Failed to load ldlinux.c32
Boot failed: press a key to retry, or wait for reset...
..

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye

Update: this still fails with the same message: "Failed to load ldlinux.c32"

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1024 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1025 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Last run:

CLIENT MAC ADDR: B0 26 28 29 5D F0  GUID: 4C4C4544-005A-5910-805A-C4C04F515032
CLIENT IP: 10.64.20.43  MASK: 255.255.255.0  DHCP IP: 208.80.154.32
GATEWAY IP: 10.64.20.1
      
PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al

Failed to load ldlinux.c32
Boot failed: press a key to retry, or wait for reset...