Page MenuHomePhabricator

Better detection for "reboot into PXE failed" conditions in wmf-auto-reimage
Open, MediumPublic

Description

Today I ran into an issue where the host I wmf-auto-reimage-host'd failed to reboot into PXE. Understanding the problem wasn't immediate to me, namely the user output was this

13:24:07 | ms-be2057.codfw.wmnet | Still waiting for reboot after 20.0 minutes
13:29:10 | ms-be2057.codfw.wmnet | Still waiting for reboot after 25.0 minutes
13:34:12 | ms-be2057.codfw.wmnet | Still waiting for reboot after 30.0 minutes
13:39:18 | ms-be2057.codfw.wmnet | Still waiting for reboot after 35.0 minutes

Whereas cumin's log file mentioned that cat /proc/uptime failed

PASS |          |   0% (0/1) [00:00<?, ?hosts/s]
FAIL |██████████| 100% (1/1) [00:00<00:00, 17.81hosts/s]
100.0% (1/1) of nodes failed to execute command 'cat /proc/uptime': ms-be2057.codfw.wmnet
100.0% (1/1) of nodes failed to execute command 'cat /proc/uptime': ms-be2057.codfw.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.

But running cumin interactively worked, and this is because cumin is executed with a debian-installer specific key.

Event Timeline

Volans triaged this task as Medium priority.Oct 11 2021, 10:54 AM
Volans subscribed.

The reimage scripts have been converted to the sre.hosts.reimage cookbook. While this issue could still happening, the current timeout after a reboot is set to 20 minutes, allowing for an earlier catch.
What I can do is to make the cookbook stop and ask the user for input instead of failing in case the wait for a reboot reaches the timeout. Thoughts?