These three hosts fail to pxe boot. @ayounsi has already put in the option-82 hack on the switch that should be necessary to fix dhcp.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Restricted Task | |||||
| Resolved | Andrew | T305828 upgrade cloud-vps openstack to Openstack version 'Yoga' | |||
| Resolved | Andrew | T296561 upgrade cloud-vps openstack to Openstack version 'Xena' | |||
| Resolved | • rook | T281275 upgrade cloud-vps openstack to Openstack version 'Wallaby' | |||
| Resolved | Andrew | T281276 Upgrade cloud-vps openstack hosts to Debian 'Bullseye' | |||
| Resolved | Andrew | T304581 Upgrade cloudvirt-wdqs servers to Debian Bullseye | |||
| Resolved | ayounsi | T305368 PXE boot failures on cloudvirt-wdqs100[1-3] |
Event Timeline
Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bullseye executed with errors:
- cloudvirt-wdqs1001 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details
Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt-wdqs1001.eqiad.wmnet with OS bullseye completed:
- cloudvirt-wdqs1001 (WARN)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204041544_pt1979_2932364_cloudvirt-wdqs1001.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
The bug was introduced with this change:
https://gerrit.wikimedia.org/r/c/operations/homer/public/+/775279/
The following one should fix it:
https://gerrit.wikimedia.org/r/c/operations/homer/public/+/776973/
I pushed it manually to the router for testing and re-imaged cloudvirt-wdqs1001, which worked.
