Page MenuHomePhabricator

cloud-init timeout too short on Bookworm
Closed, ResolvedPublic

Description

Here's a discussion (from a decade ago) about how the cloud-init systemd timeout should be unlimited:

https://github.com/sdake/heat-jeos/issues/1

That seems to have been implemented, as in Bullseye the timeouts look like this:

root@util-abogott-bullseye:~# systemctl show cloud-init.service | grep Timeout
TimeoutStartUSec=infinity
TimeoutStopUSec=infinity
TimeoutAbortUSec=infinity
TimeoutStartFailureMode=terminate
TimeoutStopFailureMode=terminate
TimeoutCleanUSec=infinity
JobTimeoutUSec=infinity
JobRunningTimeoutUSec=infinity
JobTimeoutAction=none

On recent Bookworm builds, however, we have:

root@k8s-dev-bastion:~# systemctl show cloud-init.service | grep Timeout
TimeoutStartUSec=1min 30s
TimeoutStopUSec=1min 30s
TimeoutAbortUSec=1min 30s
TimeoutStartFailureMode=terminate
TimeoutStopFailureMode=terminate
TimeoutCleanUSec=infinity
JobTimeoutUSec=infinity
JobRunningTimeoutUSec=infinity
JobTimeoutAction=none

That's kind of a disaster. On a good day we can complete our initial puppet run in 90 seconds, but as time passes and latest package versions drift, things get slower and exceed that timeout.

We need to do some detective work and figure out why that timeout was added and see if we can have it switched back. Worst case we can maybe hack in our own timeout during base image build but that'll be messy.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Apparently the Bookworm package no longer ships systemd unit files, so systemd is falling back to the SysV init scripts:

Buster
lucaswerkmeister@tools-sgebastion-10:~$ dpkg -L cloud-init | grep -e '^/etc/init.d/' -e '\.service$'
/etc/init.d/cloud-config
/etc/init.d/cloud-final
/etc/init.d/cloud-init
/etc/init.d/cloud-init-local
/lib/systemd/system/cloud-config.service
/lib/systemd/system/cloud-final.service
/lib/systemd/system/cloud-init-local.service
/lib/systemd/system/cloud-init.service
Bookworm
lucaswerkmeister@tools-bastion-12:~$ dpkg -L cloud-init | grep -e '^/etc/init.d/' -e '\.service$'
/etc/init.d/cloud-config
/etc/init.d/cloud-final
/etc/init.d/cloud-init
/etc/init.d/cloud-init-local

(This also manifests itself in some strange systemd output – for instance, systemctl status cloud-init reports “Loaded: not-found” and systemctl cat cloud-init says “No files found for cloud-init.service.”)

Given that init scripts can’t declare timeouts (in fact, systemd-sysv-generator(8) is wholly deprecated), it falls back to the default timeouts – you’ll actually get the same output with systemctl show nonexistent.service | grep Timeout.

Why is the Bookworm package missing the systemd unit files?

dpkg -L only shows files from installed packages, and for some reason cloud-init is showing as not installed on Bookworm instances:

taavi@tools-bastion-12:~ $ apt-cache policy cloud-init
cloud-init:
  Installed: (none)
taavi@tools-sgebastion-10:~ $ apt-cache policy cloud-init
cloud-init:
  Installed: 20.2-2~deb10u2

So something is uninstalling the cloud-init packages on new Bookworm instances at some point?

Apparently the Bookworm package no longer ships systemd unit files, so systemd is falling back to the SysV init scripts:

The unit files do show up here: https://packages.debian.org/bookworm/all/cloud-init/filelist

So, as taavi says, it's just not installed but they should still be in the package.

I started with with a VM from a raw debian image, and cloud-init is installed and configured:

debian@nopuppetbookworm:~$ systemctl show cloud-init.service | grep Timeout
TimeoutStartUSec=infinity
TimeoutStopUSec=infinity
TimeoutAbortUSec=infinity
TimeoutStartFailureMode=terminate
TimeoutStopFailureMode=terminate
TimeoutCleanUSec=infinity 
JobTimeoutUSec=infinity 
JobRunningTimeoutUSec=infinity
JobTimeoutAction=none

Then I installed puppet and did a standard puppet run. After...

root@nopuppetbookworm:~# systemctl show cloud-init.service | grep Timeout
TimeoutStartUSec=1min 30s
TimeoutStopUSec=1min 30s
TimeoutAbortUSec=1min 30s
TimeoutStartFailureMode=terminate
TimeoutStopFailureMode=terminate
TimeoutCleanUSec=infinity
JobTimeoutUSec=infinity
JobRunningTimeoutUSec=infinity
JobTimeoutAction=none

So puppet is breaking it. I hoped to see it being explicitly removed in the puppet log, but no such luck. It seems like puppet is doing some behind-the-scenes action like 'now remove every package that isn't explicitly in the catalog'. If that were true, I should be able to install cloud-init and then see it get removed again, right? well...

root@nopuppetbookworm:~# run-puppet-agent 
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for nopuppetbookworm.testlabs.eqiad1.wikimedia.cloud
Info: Applying configuration version '(b3ca48bb86) Andrea Denisse Gómez-Martínez - ssl: Remove performance.discovery.wmnet.crt certificate'
Notice: /Stage[main]/Base::Standard_packages/Package[eject]/ensure: removed (corrective)
Notice: Applied catalog in 7.09 seconds
root@nopuppetbookworm:~# dpkg --list | grep cloud-init
rc  cloud-init                           22.4.2-1                             all          initialization system for infrastructure cloud instances
ii  cloud-initramfs-growroot             0.18.debian13                        all          automatically resize the root partition on first boot

So puppet is removing eject, and since cloud-init requires it, puppet also removes it. that is not how I expect dependencies to work!

But, in any case, the offending puppet code is this:

# The hardware emulated by our Ganeti machine type includes a "CDROM"
# If d-i detects such a drive, it installs eject on the installed system
# (used by functionality which ejects the CDROM if installung from optical
# media. We don't need this, so uninstall it via Puppet
if $facts['is_virtual'] {
    package {'eject': ensure => 'absent'}
}

Applied April 2nd, the day before yesterday!

@Moritz, as owner of 0a8d27fd16e724300cafe781581ee3bd5556e1a2 are you attached to removing 'eject'? Typically I would suggest just explicitly installing cloud-init in a cloud-specific manifest, but since removing eject seems to forcibly remove cloud-init I'm not sure that that will work.

So puppet is removing eject, and since cloud-init requires it, puppet also removes it. that is not how I expect dependencies to work!

But, in any case, the offending puppet code is this:

# The hardware emulated by our Ganeti machine type includes a "CDROM"
# If d-i detects such a drive, it installs eject on the installed system
# (used by functionality which ejects the CDROM if installung from optical
# media. We don't need this, so uninstall it via Puppet
if $facts['is_virtual'] {
    package {'eject': ensure => 'absent'}
}

Applied April 2nd, the day before yesterday!

@Moritz, as owner of 0a8d27fd16e724300cafe781581ee3bd5556e1a2 are you attached to removing 'eject'? Typically I would suggest just explicitly installing cloud-init in a cloud-specific manifest, but since removing eject seems to forcibly remove cloud-init I'm not sure that that will work.

Feel free to revert for now, I'll come up with a better fix in the next days.

Change #1017150 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps instances: ensure cloud-init is forever installed

https://gerrit.wikimedia.org/r/1017150

Change #1017150 merged by Andrew Bogott:

[operations/puppet@production] cloud-vps instances: ensure cloud-init is forever installed

https://gerrit.wikimedia.org/r/1017150

Change #1017155 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Revert "Uninstall eject on VMs"

https://gerrit.wikimedia.org/r/1017155

Change #1017155 merged by Andrew Bogott:

[operations/puppet@production] Revert "Uninstall eject on VMs"

https://gerrit.wikimedia.org/r/1017155