Page MenuHomePhabricator

Debian installer waits for input for network config during host reimage
Closed, ResolvedPublicBUG REPORT

Description

When reimaging hosts for T351074, the installer waits for input for network configuration. Because the installer is stuck and does not finish, the cookbook eventually times out and fails.

I changed the config for the affected hosts just before the reimage in this CR, so it could be caused by a mistake in that, but I don't see how.

Hosts known to be affected: mw138[68].eqiad.wmnet,mw139[0246].eqiad.wmnet,mw1408.eqiad.wmnet,mw2317.codfw.wmnet
(I did not try with the rest of the codfw hosts in my CR.)

Cookbook commands I ran:
sudo cookbook sre.hosts.reimage -t T351074 --os bullseye mw1386 (or with --new when retrying)

Current state (as of filing this task, 2024/02/05 21:41UTC): I re-ran and then ^C'd the cookbook on mw1386, so that one is no longer in the original state (and there's a stale cookbook lock). The rest of the hosts listed above are in the original state.

Installer screenshot and contents of log tab:

installer.png (715×1 px, 76 KB)

[           1- installer  2 shell  3 shell   (4*log)           ][ Feb 05 20:43 ]
M5720 2-port Gigabit Ethernet PCIe                                              
Feb  5 18:26:45 debconf: --> INPUT medium netcfg/use_autoconfig                 
Feb  5 18:26:45 debconf: <-- 30 question skipped                                
Feb  5 18:26:45 debconf: --> GO                                                 
Feb  5 18:26:45 debconf: <-- 0 ok                                               
Feb  5 18:26:45 debconf: --> GET netcfg/use_autoconfig                          
Feb  5 18:26:45 debconf: <-- 0 false                                            
Feb  5 18:26:45 debconf: --> METAGET netcfg/internal-none description           
Feb  5 18:26:45 debconf: <-- 0 <none>                                           
Feb  5 18:26:45 debconf: --> INPUT critical netcfg/get_ipaddress                
Feb  5 18:26:45 debconf: <-- 30 question skipped                                
Feb  5 18:26:45 debconf: --> GO                                                 
Feb  5 18:26:45 debconf: <-- 0 ok                                               
Feb  5 18:26:45 debconf: --> GET netcfg/get_ipaddress                           
Feb  5 18:26:45 debconf: <-- 0 10.64.0.192                                      
Feb  5 18:26:45 debconf: --> GET netcfg/get_netmask                             
Feb  5 18:26:45 debconf: <-- 0                                                  
Feb  5 18:26:45 debconf: --> SET netcfg/get_netmask 255.255.255.0               
Feb  5 18:26:45 debconf: <-- 0 value set                                        
Feb  5 18:26:45 debconf: --> INPUT critical netcfg/get_netmask                  
Feb  5 18:26:45 debconf: <-- 0 question will be asked                           
Feb  5 18:26:45 debconf: --> GO

Details

Event Timeline

Change 997804 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] installserver: fix typo in preseed

https://gerrit.wikimedia.org/r/997804

Change 997804 merged by Volans:

[operations/puppet@production] installserver: fix typo in preseed

https://gerrit.wikimedia.org/r/997804

Cookbook cookbooks.sre.hosts.reimage was started by volans@cumin1002 for host sretest1001.eqiad.wmnet with OS bullseye

Volans triaged this task as High priority.

Ok the sretest1001 reimage is going through, I'll leave the task open until the reimage finishes. The issue was a typo (double ||) in the partman hiera configuration.

@brouberol FYI, in case you were working on any form of additional sanity checks for that data

mw1386 seems to be going fine now, so yes, we can close this. Sorry and thanks for finding the cause @Volans <3

Cookbook cookbooks.sre.hosts.reimage started by volans@cumin1002 for host sretest1001.eqiad.wmnet with OS bullseye completed:

  • sretest1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402061221_volans_2418035_sretest1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

For posterity I'd also like to mention how misleading was the error message, as the debian-installer UI looked like it was failing to get the proper netmask in the network configuration, and indeed on one of the host failing I found from ip address:

inet 10.64.32.125/22 scope global eno1

instead of a normal:

inet 10.64.32.125/22 brd 10.64.35.255 scope global eno1

Thanks for flagging, @Volans. I'm not working on any additional sanity checks at the moment, but we can probably have a look at how we could have detected this in CI