During the reimage of new cp hosts in eqiad (see T342159 and T349244) we noticed a odd behavior: on the first 3 hosts (cp110[0-2]) the reimage cookbook (eg. sudo -i cookbook sre.hosts.reimage --os bullseye --new cp1100 -t T349244) will wait until timeout after the first reboot:
[snip] 2023-10-31 16:32:04,251 [INFO] dhcp config test passed! 2023-10-31 16:32:06,333 [INFO] reloaded isc-dhcp-server ================ 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/local/sbin/...cludes -r commit'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Released lock for key /spicerack/locks/modules/spicerack.dhcp.DHCP:eqiad: {'concurrency': 1, 'created': '2023-10-31 16:32:02.959747', 'owner': 'fabfur@cumin1001 [94491]', 'ttl': 120} Running IPMI command: ipmitool -I lanplus -H cp1104.mgmt.eqiad.wmnet -U root -E chassis bootparam set bootflag force_pxe options=reset Running IPMI command: ipmitool -I lanplus -H cp1104.mgmt.eqiad.wmnet -U root -E chassis bootparam get 5 Forced PXE for next reboot Running IPMI command: ipmitool -I lanplus -H cp1104.mgmt.eqiad.wmnet -U root -E chassis power status Running IPMI command: ipmitool -I lanplus -H cp1104.mgmt.eqiad.wmnet -U root -E chassis power cycle Host rebooted via IPMI [1/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Reboot for cp1104.eqiad.wmnet not found yet, keep polling for it: unable to get uptime Caused by: Cumin execution failed (exit_code=2) [2/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Reboot for cp1104.eqiad.wmnet not found yet, keep polling for it: unable to get uptime Caused by: Cumin execution failed (exit_code=2) [snip]
The hosts console meanwhile is "stuck" at
Booting from BRCM MBA Slot 4B00 v218.0.219.1 Broadcom UNDI PXE-2.1 v218.0.219.1 Copyright (C) 2000-2020 Broadcom Limited Copyright (C) 1997-2000 Intel Corporation All rights reserved.
If the cookbook is manually stopped and re-launched the installation seems to work fine until the end.
This could indicate an issue that could be solved by a "manual" reboot after this first step.
We're however experiencing a different behavior on cp1103-1104 (the latest hosts we're reinstalling just now). Seems that on those hosts even with the "double reboot" the cookbook wouldn't proceed with the installation as the previous ones.
If you need some help on our side to debug this behavior don't hesitate to contact us!