Page MenuHomePhabricator

ncmonitor1001 install issues (Ganeti VM fails to reboot after "gnt-instance modify")
Closed, ResolvedPublic

Description

I was attempting to create a brand new VM for a brand new service using the makevm cookbook:

$ sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 2 --disk 20 --cluster eqiad --group A -t T356710 --os bookworm -p 7 ncmonitor1001

The imaging portion is failing to start the first boot:

Set boot media to network for VM ncmonitor1001.eqiad.wmnet in cluster eqiad
Forced PXE for next reboot
Shutting down VM ncmonitor1001.eqiad.wmnet in cluster eqiad
----- OUTPUT of 'gnt-instance shu...1001.eqiad.wmnet' -----
Waiting for job 2397616 for ncmonitor1001.eqiad.wmnet ...
Mon Feb 12 23:15:53 2024  - WARNING: Ignoring offline instance check
================
PASS |██████████████████████████████████████████████████████| 100% (1/1) [00:10<00:00, 10.10s/hosts]
FAIL |                                                              |   0% (0/1) [00:10<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'gnt-instance shu...1001.eqiad.wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Starting VM ncmonitor1001.eqiad.wmnet in cluster eqiad
----- OUTPUT of 'gnt-instance sta...1001.eqiad.wmnet' -----
Waiting for job 2397617 for ncmonitor1001.eqiad.wmnet ...
================
PASS |██████████████████████████████████████████████████████| 100% (1/1) [00:05<00:00,  5.02s/hosts]
FAIL |                                                              |   0% (0/1) [00:05<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'gnt-instance sta...1001.eqiad.wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Host rebooted via gnt-instance
[1/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Reboot for ncmonitor1001.eqiad.wmnet not found yet, keep polling for it: unable to get uptime
Caused by: Cumin execution failed (exit_code=2)
[2/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Reboot for ncmonitor1001.eqiad.wmnet not found yet, keep polling for it: unable to get uptime
Caused by: Cumin execution failed (exit_code=2)
[3/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Reboot for ncmonitor1001.eqiad.wmnet not found yet, keep polling for it: unable to get uptime
Caused by: Cumin execution failed (exit_code=2)
[4/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Reboot for ncmonitor1001.eqiad.wmnet not found yet, keep polling for it: unable to get uptime
Caused by: Cumin execution failed (exit_code=2)
[5/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Reboot for ncmonitor1001.eqiad.wmnet not found yet, keep polling for it: unable to get uptime
Caused by: Cumin execution failed (exit_code=2)
Found reboot since 2024-02-12 23:15:48.576663 for hosts ncmonitor1001.eqiad.wmnet
Host up (Debian installer)
Add puppet_version metadata to Debian installer
----- OUTPUT of 'gnt-instance mod...1001.eqiad.wmnet' -----
Modified instance ncmonitor1001.eqiad.wmnet
 - hv/boot_order -> disk
Please don't forget that most parameters take effect only at the next (re)start of the instance initiated by ganeti; restarting from within the instance will not be enough.
Note that changing hypervisor parameters without performing a restart might lead to a crash while performing a live migration. This will be addressed in future Ganeti versions.
================
PASS |██████████████████████████████████████████████████████| 100% (1/1) [00:03<00:00,  3.12s/hosts]
FAIL |                                                              |   0% (0/1) [00:03<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'gnt-instance mod...1001.eqiad.wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Set boot media to disk for VM ncmonitor1001.eqiad.wmnet in cluster eqiad
Set boot media to disk
[1/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Reboot for ncmonitor1001.eqiad.wmnet not found yet, keep polling for it: uptime 66.0 > threshold 4.49
[...]

I re-attempted with:

$ sudo cookbook sre.hosts.reimage -t T356710 --os bookworm ncmonitor1001 --new

But sadly the same issue occurs.

Event Timeline

One more data point: note that gnt-instance console FQDN is broken because of T309724 so we don't know the exact failure.

The cookbook doesn't reboot the host once in the Debian Installer, it's the Debian Installer that reboots the hosts once the base installation is completed.

Have you checked what was the status of the host during the installation?

I noticed that the partman configuration for the host was added merged only today in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1002674 . Is it possible d-i was just stuck there asking for the partition configuration?

The cookbook doesn't reboot the host once in the Debian Installer, it's the Debian Installer that reboots the hosts once the base installation is completed.

Have you checked what was the status of the host during the installation?

I guess the related question is what is the other way to do that for a VM when it is in that state if gnt-instance console is broken?

Thanks for the response, @Volans

The cookbook doesn't reboot the host once in the Debian Installer, it's the Debian Installer that reboots the hosts once the base installation is completed.

Have you checked what was the status of the host during the installation?

I noticed that the partman configuration for the host was added merged only today in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1002674 . Is it possible d-i was just stuck there asking for the partition configuration?

I did merge that in only today and tried to run the reimage cookbook with the same results.

As @ssingh mentioned, it's hard to see the status of the installer what with gnt-instance console being nonfunctional. I did try to get some info with install-console but that brought me to a busybox prompt. Does that suggest it's not booting properly?

I think there is some confusion, let me clarify some things:

  1. BusyBox is the environment available during debian installer. That's totally normal. From a quick look at ps, the logs and the files in it seems to me it's waiting for user input during partitioning.
  1. If you run sudo gnt-instance console --show-cmd ncmonitor1001.eqiad.wmnet it's very easy to see the command that it's actually run and you can either run that from the master node or directly on the actual host where the instance is running and you get a working console. That confirmed that it's at the partitioning step waiting for user input (attaching to the console I might have gone one step forward/backward)
  1. I don't think the issue at hand has much to do nor with the task title nor the cookbook/spicerack automation. So far it seems just a normal d-i partman configuration issue with the complication of T309724.

Unless you feel the need to keep this task open for some specific reason I think it can be closed.

  1. If you run sudo gnt-instance console --show-cmd ncmonitor1001.eqiad.wmnet it's very easy to see the command that it's actually run and you can either run that from the master node or directly on the actual host where the instance is running and you get a working console.

While gnt-instance console --show-cmd does show a command I can't confirm the part that you get a working console with either method.

[ganeti1027:~] $ sudo gnt-instance console ncmonitor1001.eqiad.wmnet
..
Please contact your system administrator.
..
Offending DSA key in /var/lib/ganeti/known_hosts:2
[ganeti1027:~] $ ssh -oEscapeChar=none -oHashKnownHosts=no -oGlobalKnownHostsFile=/var/lib/ganeti/known_hosts -oUserKnownHostsFile=/dev/null -oCheckHostIp=no -oConnectTimeout=10 -oHostKeyAlias=ganeti01.svc.eqiad.wmnet -oBatchMode=yes -oStrictHostKeyChecking=yes -4 -t -t root@ganeti1035.eqiad.wmnet '/usr/lib/ganeti/tools/kvm-console-wrapper /usr/bin/socat ncmonitor1001.eqiad.wmnet /var/run/ganeti/kvm-hypervisor/ctrl/ncmonitor1001.eqiad.wmnet.monitor STDIO,raw,echo=0,escape=0x1d UNIX-CONNECT:/var/run/ganeti/kvm-hypervisor/ctrl/ncmonitor1001.eqiad.wmnet.serial'
...
...
RSA host key for ganeti01.svc.eqiad.wmnet has changed and you have requested strict checking.
...

In T357449#9540286, @Volans wrote:
just a normal d-i partman configuration issue

The code change from https://gerrit.wikimedia.org/r/c/operations/puppet/+/1002674/3/modules/profile/data/profile/installserver/preseed.yaml still looks just fine to me, nothing special with that?

And it was merged a couple hours before 2 attempts of wmf-reimage by 2 separate people, afaict.

So it's still a mystery to me why it would hang at partman config unless anyone sees a typo there.

@Dzahn you can get a working console either setting the known hosts files to /dev/null and the strict checking to no in the ssh command running it from the master node or you can get a working console just running the actual command directly on the ganeti host where the VM is running (the part from /usr/lib/ganeti/tools/kvm-console-wrapper ...).

As for why it's failing that needs to be debugged by the service owner. Given that the first installation was without the partman recipe I don't know what happened there, it's possible that maybe it left the disks in a state that partman is not able to recover from or something else. But when I got the console it was indeed at the partition steps.

@Volans Ah, yes, i can get a console when running sudo /usr/lib/ganeti/tools/kvm-console-wrapper /usr/bin/socat ncmonitor1001.eqiad.wmnet /var/run/ganeti/kvm-hypervisor/ctrl/ncmonitor1001.eqiad.wmnet.monitor STDIO,raw,echo=0,escape=0x1d UNIX-CONNECT:/var/run/ganeti/kvm-hypervisor/ctrl/ncmonitor1001.eqiad.wmnet.serial on ganeti1035 directly, without the ssh part.

That works, thanks for clarifying.

@BCornwall Upon getting the console I hit "Yes" to confirm it is allowed to write the partition changes. Now the system is installing. Currently installing the base system on ncmonitor1001.

@BCornwall .. but then after installing the base system it fails at installing grub in /dev/sda.. which is not expected.

I did another wmf-reimage cookbook run on this host and the installation finished, including the grub install. I can't explain why it wouldn't work for Brett and Sukhbir earlier (after the partman change was already merged) but it worked now.

After the OS install was done I could see in the console the ganeti instance was shut down and now sitting at login while the puppet part is starting. Whether that works on first run we will see but is another story.

The install issues seem solved to me.

Dzahn renamed this task from Ganeti VM fails to reboot after "gnt-instance modify" to ncmonitor1001 install issues (Ganeti VM fails to reboot after "gnt-instance modify").Feb 14 2024, 12:03 AM
Dzahn changed the task status from Open to In Progress.

The host should be usable now:

[ncmonitor1001:~] $ uptime
 00:15:21 up 1 min,  1 user,  load average: 0.15, 0.04, 0.01
[ncmonitor1001:~] $ lsb_release -c
No LSB modules are available.
Codename:	bookworm
BCornwall claimed this task.

I'm still not sure where the problem lies and am concerned that this issue would rear its head again should a reimage be attempted. But as @Volans seems convinced this isn't an issue with cookbooks et al, I'll just close this.