Page MenuHomePhabricator

cloudvirt1025 and cloudvirt1026 fail to pxe boot
Closed, ResolvedPublic

Description

These hosts registers the intent to pxe boot but ultimately fall through to HDD boot.

This is presumably not the same issue as T303296 since there's a hack in place for that issue.

It's definitely not the same issue as T303773 since in that case the host DOES pxe boot but fails dhcp during debian install

It's also probably not the same as T293391 since that seems to be a misplaced cable.

Event Timeline

This host is currently booted into the hdd install for debugging purposes. It's drained of VMs so can be rebooted at any time.

"This host registers the intent to pxe boot but ultimately falls through to HDD boot." is exactly the kind of error that we saw in https://phabricator.wikimedia.org/T296856 (and similar tasks which required a firmware update to fix it)

Aklapper removed a subscriber: ops-eqiad.

[Please add project tags under project tags instead of subscribers - thanks!]

Hey @Cmjohnson and or @RobH can we get a firmware update on this host and also on cloudvirt1024 which is exhibiting a different out-of-date-firmware issue? Thanks!

Updated bios, raid, and nic firmwares (both 1g and 10g)

Andrew renamed this task from cloudvirt1025 fails to pxe boot to cloudvirt1025 and cloudvirt1026 fail to pxe boot.Mar 15 2022, 10:01 PM
Andrew updated the task description. (Show Details)

Firmware updates do not seem to have improved anything; same failure to pxe boot as before.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye

I was able to pxe boot with 1024 but got

Failed to load ldlinux.c32

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1024 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1026.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1026.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1026 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details
Failed to load ldlinux.c32

At first sight this might be an occurrence of this issue: https://bugs.launchpad.net/ubuntu/+source/syslinux/+bug/1577554
For which the suggested solution is to copy ldlinux.c32 to the root of the root-path indicated in the pxelinux settings in the DHCP.
But that's not a viable option for us as we have different OS versions to support.

initial PXE boot sequence
CLIENT MAC ADDR: B0 26 28 29 5D F0  GUID: 4C4C4544-005A-5910-805A-C4C04F515032
CLIENT IP: 10.64.20.43  MASK: 255.255.255.0  DHCP IP: 208.80.154.32
GATEWAY IP: 10.64.20.1
      
PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al

Running atftpd manually to increase the log level to debug:
install1003:~$ sudo /usr/sbin/atftpd --port 69 --tftpd-timeout 300 --retry-timeout 5 --mcast-port 1758 --mcast-addr 239.239.239.0-255 --mcast-ttl 1 --maxthread 100 --verbose=7 /srv/tftpboot --daemon --no-fork returns:

Mar 16 09:00:14 install1003 atftpd[31809]: socket may listen on any address, including broadcast
Mar 16 09:00:14 install1003 atftpd[31809]: Creating new socket: 208.80.154.32:32991
Mar 16 09:00:14 install1003 atftpd[31809]: Serving lpxelinux.0 to 10.64.20.43:2070
Mar 16 09:00:14 install1003 atftpd[31809]: tsize option -> 75607
Mar 16 09:00:14 install1003 atftpd[31809]: Aborting transfer
Mar 16 09:00:14 install1003 atftpd[31809]: Server thread exiting
Mar 16 09:00:14 install1003 atftpd[31809]: socket may listen on any address, including broadcast
Mar 16 09:00:14 install1003 atftpd[31809]: Creating new socket: 208.80.154.32:52909
Mar 16 09:00:14 install1003 atftpd[31809]: Serving lpxelinux.0 to 10.64.20.43:2071
Mar 16 09:00:14 install1003 atftpd[31809]: blksize option -> 1456
Mar 16 09:00:14 install1003 atftpd[31809]: End of transfer
Mar 16 09:00:14 install1003 atftpd[31809]: Server thread exiting

So the client (PXE) requests the file lpxelinux.0 over tftp, the transfer fails on the 1st try, but succeeds on the 2nd try, confirmed with tcpdump:

Screenshot from 2022-03-16 10-05-35.png (151×1 px, 54 KB)

and then it goes fine: data packet 52 (last), with its ack.

DHCP config have: option pxelinux.pathprefix "http://apt.wikimedia.org/tftpboot/{s.distro}-installer/";

But tcpdump on apt1001 doesn't show any traffic with host 10.64.20.43.

After a long waiting time, PXE shows:

Failed to load ldlinux.c32
Boot failed: press a key to retry, or wait for reset...

Is it possible to upgrade PXE? The current version seems quite old: 20150819

Just a note that the task for cloudvirt1024 is T303773, this task is for 1025/1026. They are failing for different reasons AFAICT.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1024 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1024 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bullseye

Great news, everybody! For a sanity check I just now tried to re-image cloudvirt1016, and it won't pxe-boot either. That suggests that the fix put in place for T303296 might have reverted and is causing at least some of the current install issues. @ayounsi can you investigate this angle?

(And, cloudvirt1016 is now out of service, so you're welcome to re-experiment there.)

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1025 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1026.eqiad.wmnet with OS bullseye

Latest update:

cloudvirt1016 is working again, thanks to papaul's firmware updates.

Cloudvirt1024 still fails on image load.

Cloudvirt1025 and 1026 still fail early with no error message.

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1026.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1026 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

So on the Failed to load ldlinux.c32:

I got cloudvirt1024 to boot the debian installer using:

install1003:~$ cat /etc/dhcp/automation/ttyS1-115200/cloudvirt1024.conf
host cloudvirt1024 {
    host-identifier option agent.circuit-id "asw2-b-eqiad:xe-2/0/25.0:cloud-hosts1-eqiad";
    fixed-address 10.64.20.43;
    #option pxelinux.pathprefix "http://apt.wikimedia.org/tftpboot/bullseye-installer/";
    option pxelinux.pathprefix "bullseye-installer/";
    filename "bullseye-installer/debian-installer/amd64/pxelinux.0";
}

The commented out line is the default config, it replaces filename and the other pxelinux.pathprefix.

so basically falling back to pxelinux instead of lpxelinux, and doing everything over tftp instead of http.

It looks like we're hitting a NIC firmware bug, similar to what's documented in that rabbit hole: http://marcoguerri.github.io/2016/03/pxeboot-failures-chelsio
if that's the case other hosts might end up hitting it as well if we upgrade their NIC firmware.

I suggest we:

  • add that workaround in the re-image cookbook (@Volans?)
  • look at if we have other hosts with the same NIC
  • and/or see if we can upgrade/downgrade the firmware (@Papaul?)

In case this is an additional data point: I just reimaged cloundnet1003 and cloudnet1004 without any pxe or image issues.

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bullseye

Downgrade NIC firmware on cloudvrit1025 and cloudvirt1026 from 22.00.07.60 to 21.60.22.11 fixed the

Failed to load ldlinux.c32''

issue

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1025 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1026.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1026.eqiad.wmnet with OS bullseye completed:

  • cloudvirt1026 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202203222146_andrew_2746015_cloudvirt1026.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Downgrade NIC firmware on cloudvrit1025 and cloudvirt1026 from 22.00.07.60 to 21.60.22.11 fixed the

Failed to load ldlinux.c32''

issue

So this sounds like a regression in the NIC firmware? Can we report this to Dell via some channel or our rep?

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bullseye completed:

  • cloudvirt1025 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202203222146_andrew_2746011_cloudvirt1025.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1047 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1024 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bullseye

These hosts are now reimaged and running VMs. Thanks for all the attention everyone!

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1047 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1024 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1024 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1027.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1027.eqiad.wmnet with OS bullseye completed:

  • cloudvirt1027 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202203231335_andrew_2877650_cloudvirt1027.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB