Page MenuHomePhabricator

Better detection for "reboot into PXE failed" conditions in wmf-auto-reimage
Open, Needs TriagePublic

Description

Today I ran into an issue where the host I wmf-auto-reimage-host'd failed to reboot into PXE. Understanding the problem wasn't immediate to me, namely the user output was this

13:24:07 | ms-be2057.codfw.wmnet | Still waiting for reboot after 20.0 minutes
13:29:10 | ms-be2057.codfw.wmnet | Still waiting for reboot after 25.0 minutes
13:34:12 | ms-be2057.codfw.wmnet | Still waiting for reboot after 30.0 minutes
13:39:18 | ms-be2057.codfw.wmnet | Still waiting for reboot after 35.0 minutes

Whereas cumin's log file mentioned that cat /proc/uptime failed

PASS |          |   0% (0/1) [00:00<?, ?hosts/s]
FAIL |██████████| 100% (1/1) [00:00<00:00, 17.81hosts/s]
100.0% (1/1) of nodes failed to execute command 'cat /proc/uptime': ms-be2057.codfw.wmnet
100.0% (1/1) of nodes failed to execute command 'cat /proc/uptime': ms-be2057.codfw.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.

But running cumin interactively worked, and this is because cumin is executed with a debian-installer specific key.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptThu, Sep 3, 1:58 PM