Page MenuHomePhabricator

PXE boot failures on cloudvirt-wdqs100[1-3]
Closed, ResolvedPublic

Description

These three hosts fail to pxe boot. @ayounsi has already put in the option-82 hack on the switch that should be necessary to fix dhcp.

Event Timeline

Here is the last thing I see before a blank screen and then grub:

Screen Shot 2022-04-04 at 8.27.13 AM.png (1×1 px, 376 KB)

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt-wdqs1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bullseye completed:

  • cloudvirt-wdqs1001 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204041544_pt1979_2932364_cloudvirt-wdqs1001.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

The bug was introduced with this change:
https://gerrit.wikimedia.org/r/c/operations/homer/public/+/775279/

The following one should fix it:
https://gerrit.wikimedia.org/r/c/operations/homer/public/+/776973/

I pushed it manually to the router for testing and re-imaged cloudvirt-wdqs1001, which worked.

ayounsi claimed this task.

Fix merged.