Page MenuHomePhabricator

Cloud init and unattended upgrades while bootstrapping Trixie VMs
Closed, ResolvedPublic

Description

Hey folks!

I've created elky-kfk-pgrd-kafka-test-03.kafka-infrastructure.eqiad1.wikimedia.cloud via Pontoon and I ended up in this problem:

Notice: Requesting catalog from elky-kfk-pgrd-puppet-01.kafka-infrastructure.eqiad1.wikimedia.cloud:8140 (172.16.20.168)
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Undefined variable '::hostname' (file: /etc/puppet/hiera.yaml, line: 4) on node elky-kfk-pgrd-kafka-test-03.kafka-infrastructure.eqiad1.wikimedia.cloud
Warning: Not using cache on failed catalog
elukey@elky-kfk-pgrd-kafka-test-03:~$ dpkg -l puppet-agent
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version      Architecture Description
+++-==============-============-============-======================================
ii  puppet-agent   8.10.0-5     all          configuration management system, agent

After a chat with Filippo it seems the issue seems related to a race-condition between cloud-init and unattended upgrades:

2026-04-07T10:35:27.044388+00:00 elky-kfk-pgrd-kafka-test-03 cloud-init[627]: #033[1;31mError: /Stage[main]/Puppet::Agent/Package[puppet]/ensure: change from 'purged' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install puppet' returned 100: E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 5015 (unattended-upgr)

The problem gets fixed simply installing puppet-agent, that downgrades the package.

Event Timeline

Change #1269056 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] nova vendordata: disable unattended upgrades in base image

https://gerrit.wikimedia.org/r/1269056

Andrew triaged this task as Medium priority.Apr 8 2026, 8:39 PM
Andrew subscribed.

Do you have any theory (you being @elukey and @fgiunchedi) about why that happened on this exact instance? I just checked and we have around 100 running Trixie VMs so presumably cloud-init works properly most of the time.

If you think the answer is 'we just got unlucky this once' then the attached patch will probably fix things in future base images, although we'll need to test a bit.

This is not an unattended-upgrades problem. It instead seems to be a problem with how the Puppet agent packages are installed - Trixie by default includes Puppet 8, but our codebase is not yet fully compatible with Puppet 8 (and particularly its removal of deprecated facts), so we need to pull in the Puppet 7 agent packages from a component on apt.wm.o.

The cloud-init file is clearly trying to do that (see line 299), but at the same time it doesn't seem to even define that component in the apt config so that'll clearly never work. The first Puppet run will fix it and downgrade the host to Puppet 7, but that is failing here as the host has a catalog that won't even compile with a Puppet 8 agent.

(I believe I've hit this a couple of times before, but mostly just used those as opportunities to get rid of said legacy facts as we need to get it done sooner or later anyway.)

Indeed under normal circumstances cloud-init will try to bring back puppet to 7 after the first puppet run (from modules/openstack/templates/nova/vendordata.txt.erb)

# The following will be a no-op on most distros, but
#  as of 08-2025 Trixie comes packaged with puppet 8
#  and the puppet manifests themselves seek to downgrade
#  to 7.
apt install -y --allow-downgrades puppet-agent

And that's what failed in this case, from /var/log/cloud-init-output.log:

+ apt install -y --allow-downgrades puppet-agent

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Waiting for cache lock: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 5015 (unattended-upgr)...
Waiting for cache lock: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 5015 (unattended-upgr)...
Waiting for cache lock: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 5015 (unattended-upgr)...
Waiting for cache lock: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 5015 (unattended-upgr)...

It seems that this trixie image comes with puppet agent 8 installed already, I think we should make sure images have puppet 7 installed since this is what we expect and what is supported, and let users opt-in to puppet 8 as needed. I generally agree that fixing puppet to be 8-compatible is something we need to do though!

Alternatively we can ask unattended-upgrades to not do anything until cloud-init has finished, though I'd rather avoid fixing up the fix up

The base image is based on a trixie VM with our puppet classes already applied (that happens at build time). So shouldn't /that/ have already downgraded puppet in the base image?

The base image is based on a trixie VM with our puppet classes already applied (that happens at build time). So shouldn't /that/ have already downgraded puppet in the base image?

My understanding is that cloud-init will wipe the existing sources lists and replace them with its own, so I suspect the cloud-init module managing the Puppet install and initial setup will pull in the updated Puppet 8 agent from the main Debian repository.

We are attempting to only get the puppet package from the wikimedia repo (this is set by cloud-init at creation time)

- content: |
    Package: *
    Pin: release o=wikimedia
    Pin-Priority: 1001
  path: /etc/apt/preferences.d/puppet.pref

We are attempting to only get the puppet package from the wikimedia repo (this is set by cloud-init at creation time)

Yes, but something needs to add component/puppet7 to the Apt sources config on Trixie, or otherwise that pin will not help in this case at all. (Also, while that pin is named Puppet it doesn't seem to be puppet-specific at all with a match on Package: *..)

So here is what should be happening, and is configured to happen:

  1. Base image is fully puppetized as a Trixie host. By the time puppet is done with it, we should have the correct puppet 7 install on the base image

< new VM created >

  1. apt config persists from step 1, thanks to the cloud-init setting apt_preserve_sources_list: True
  2. apt is updated (package_update: true)
  3. all packages upgraded (package_upgrade: true) according to the apt config (see step 2)

< so either the puppet package should not be touched at all, or possibly a point upgrade >

  1. puppet does what puppet does

So... where is the breakdown happening? Is apt/preferences.d getting clobbered despite apt_preserve_sources_list: True ?

Oh, one other point that might not be obvious: I can't just write straight to preferences.d from cloud-init because that's distro-dependent whereas cloud-init config seeks to be distro-neutral. Cloud-init does run a plain old shell script at the end, though, where we can do whatever.

Change #1270975 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps vendordata: update apt preferences

https://gerrit.wikimedia.org/r/1270975

Change #1271000 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps vendordata:

https://gerrit.wikimedia.org/r/1271000

Change #1270975 merged by Andrew Bogott:

[operations/puppet@production] cloud-vps vendordata: update apt preferences

https://gerrit.wikimedia.org/r/1270975

Change #1271000 merged by Andrew Bogott:

[operations/puppet@production] cloud-vps vendordata: force puppet install during image creation

https://gerrit.wikimedia.org/r/1271000

Change #1271002 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps vendordata: typo fix in apt line

https://gerrit.wikimedia.org/r/1271002

Change #1271002 merged by Andrew Bogott:

[operations/puppet@production] cloud-vps vendordata: typo fix in apt line

https://gerrit.wikimedia.org/r/1271002

Assumption #1 is incorrect: puppet does not actually downgrade puppet to version 7. So, our base image had puppet 8 already installed.

I have correct that with the attached patches. NOW we will see if assumption #2 (apt config persists) is also true or if cloud-init just immediately slaps puppet 8 on top of that when a guest VM is booted.

I think we're good -- I don't see new guest VMs trying to install puppet on startup. I'll roll out new base images shortly.

Change #1271029 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps vendordata: slightly more cleanup

https://gerrit.wikimedia.org/r/1271029

Change #1271029 merged by Andrew Bogott:

[operations/puppet@production] cloud-vps vendordata: slightly more cleanup

https://gerrit.wikimedia.org/r/1271029

Change #1269056 abandoned by Andrew Bogott:

[operations/puppet@production] nova vendordata: disable unattended upgrades in base image

https://gerrit.wikimedia.org/r/1269056

Andrew claimed this task.