Page MenuHomePhabricator

cloud-init overrides /etc/apt/sources.list on instance boot
Closed, ResolvedPublic

Description

When booting a Bullseye image, Puppet fails to apt-get update due to an erroneous component for the security packages:

Err:11 http://security.debian.org bullseye/updates Release

This cause the instance provisioning to be partially done and on a project having a standalone Puppet master (such as integration) admin of the project don't have any sudo rules. That means in turn the instance can not be self fixed.

Since Bullseye the security component has been renamed from bullseye/update to bullseye-security.

When booting, cloud-init modules:config step regenerate /etc/apt/sources.list from a template provided by the Debian package which still refers to the old /update suffix:

/etc/cloud/templates/sources.list.debian.tmpl
...
deb {{security}} {{codename}}/updates main
...

Looking at /var/log/cloud-init-output.log on an instance I created yesterday (integration-agent-docker-1040.integration.eqiad1.wikimedia.cloud), I believe the sequence is:


The image is created (June 8th in this case)
cloud-init generates the faulty sources.list leading to a fetch error:

Cloud-init v. 20.4.1 running 'modules:config' at Thu, 08 Jun 2023 15:22:32 +0000. Up 24.28 seconds.
Err:8 http://security.debian.org bullseye/updates Release
  404  Not Found [IP: 199.232.30.132 80]
Reading package lists...
E: The repository 'http://security.debian.org bullseye/updates Release' does not have a Release file.
2023-06-08 15:22:34,605 - util.py[WARNING]: Running module apt-configure (<module 'cloudinit.config.cc_apt_configure' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_apt_configure.py'>) failed

Puppet updates the list:

Notice: /Stage[main]/Apt/File[/etc/apt/sources.list]/content: 
--- /etc/apt/sources.list       2023-06-08 15:22:32.689301142 +0000
+++ /tmp/puppet-file20230608-7966-18p6hco       2023-06-08 15:24:02.889444582 +0000
-deb http://security.debian.org/ bullseye/updates main
-deb-src http://security.debian.org/ bullseye/updates main
+deb http://security.debian.org/debian-security bullseye-security main contrib non-free
+deb-src http://security.debian.org/debian-security bullseye-security main contrib non-free
Info: Computing checksum on file /etc/apt/sources.list
Info: /Stage[main]/Apt/File[/etc/apt/sources.list]: Filebucketed /etc/apt/sources.list to puppet with sum fd1359818a6dfdf26ba4a431eb7af6cb

The resulting image is saved. I believe it can be manually inspected to check the apt config is correct. At least Puppet ran and refreshed apt-get update successfully as far as I can tell.

When booting an instance out of this image cloud-init kicks in:

Cloud-init v. 20.4.1 running 'modules:config' at Thu, 29 Jun 2023 18:36:26 +0000. Up 16.85 seconds.
...
Err:11 http://security.debian.org bullseye/updates Release
  404  Not Found [IP: 199.232.30.132 80]
...
E: The repository 'http://security.debian.org bullseye/updates Release' does not have a Release file.

Which is ignored. Then Puppet run:

Info: /Stage[main]/Apt/Apt::Repository[wikimedia]/File[/etc/apt/sources.list.d/wikimedia.list]: Scheduling refresh of Exec[apt_repository_wikimedia]
Notice: /Stage[main]/Apt/Apt::Repository[wikimedia]/Exec[apt_repository_wikimedia]/returns: E: The repository 'http://security.debian.org bullseye/updates Release' does not have a Release file.
Error: /Stage[main]/Apt/Apt::Repository[wikimedia]/Exec[apt_repository_wikimedia]: Failed to call refresh: '/usr/bin/apt-get update' returned 100 instead of one of [0]

Which breaks things.


I believe we should tell cloud-init to not manage /etc/apt/sources.list since we managed it via Puppet. I think the parameter comes from:

modules/openstack/templates/nova/vendordata.txt.erb
# You'll see that we're setting apt_preserve_sources_list twice here.  That's
#  because there's a bug in cloud-init where it tries to reconcile the
#  two settings and if they're different the stage fails. That means that
#  if one of them is set differently from the default (True) then nothing
#  works.
apt_preserve_sources_list: False
apt:
    preserve_sources_list: False

Which mean on instance boot up cloud-init generate the sources.list from the faulty template which in turns lead to Puppet failing. I think we should remove that parameter since the sources.list is managed by Puppet.

Note: @taavi proposed to reorder Puppet resources https://gerrit.wikimedia.org/r/c/operations/puppet/+/934409

Event Timeline

Is cloud-init updating the file on every boot (or other times) or just on the initial VM creation? I'm pretty sure we need things set on first boot, at least.

Confirmed, the apt cloud-init module is 'Module frequency: once-per-instance' so it only happens on initial startup.

We're using this module to inject the wikimedia repos, which we need in order to get the proper puppet package installed. I'm not clear on what's happening with the debian repos.

The preserve_sources_list option overrides all other config keys that would alter sources.list or sources.list.d, except for additional sources to be added to sources.list.d

That seems promising :)

Change 934550 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-init: apt_preserve_sources_list=True

https://gerrit.wikimedia.org/r/934550

Change 934550 merged by Andrew Bogott:

[operations/puppet@production] cloud-init: apt_preserve_sources_list=True

https://gerrit.wikimedia.org/r/934550

hashar claimed this task.

@Andrew wrote:

Is cloud-init updating the file on every boot (or other times) or just on the initial VM creation? I'm pretty sure we need things set on first boot, at least.
...
Confirmed, the apt cloud-init module is 'Module frequency: once-per-instance' so it only happens on initial startup.

Ah good to know. I also discovered /var/log/cloud-init.log which has debug logging and timestamps.

I went to create a new instance integration-agent-docker-1041 (still based on the Bullseye image from June 8th). Now that we have apt_preserve_sources_list=True, cloud-init does not overwite /etc/apt/sources.list anymore and when it does the apt-get update the output shows the repo defined from Puppet:

[   17.290658] cloud-init[712]: Get:7 http://security.debian.org/debian-security bullseye-security InRelease [48.4 kB]

And from cloud init logs:

Cloud-init v. 20.4.1 running 'modules:config' at Tue, 04 Jul 2023 10:15:56 +0000. Up 14.18 seconds.
...
Get:7 http://security.debian.org/debian-security bullseye-security InRelease [48.4 kB]
...
Fetched 1489 kB in 3s (582 kB/s)
...
Cloud-init v. 20.4.1 running 'modules:final' at Tue, 04 Jul 2023 10:16:03 +0000. Up 21.24 seconds.
...
+ puppet agent --onetime --verbose --no-daemonize --no-splay --show_diff
Info: Using configured environment 'production'
...
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for integration-agent-docker-1041.integration.eqiad1.wikimedia.cloud
Info: Applying configuration version '(da7d4fae69) David Caro - P:toolforge: mail: blackhole noreply@'
...
Notice: /Stage[main]/Apt/File[/etc/apt/sources.list]/content: 
--- /etc/apt/sources.list       2023-06-08 15:24:02.913444806 +0000
+++ /tmp/puppet-file20230704-2073-x61t3o        2023-07-04 10:17:55.958593204 +0000
@@ -2,10 +2,10 @@
 ## This file is managed by puppet.
 ## Any local changes will be swiftly overwritten
 ##
-## Most cloud-vps projects can make persistent changes to apt sources 
+## Most cloud-vps projects can make persistent changes to apt sources
 ## by adding a new .list file in /etc/apt/sources.list.d.
 ##
-## Some cloud-vps projects have 'cloud.yaml:profile::apt::purge_sources' 
+## Some cloud-vps projects have 'profile::apt::purge_sources'
 ## set to 'true', in which case apt sources can only be managed
 ## via puppet.
 ##
Info: Computing checksum on file /etc/apt/sources.list
Info: /Stage[main]/Apt/File[/etc/apt/sources.list]: Filebucketed /etc/apt/sources.list to puppet with sum cf7d931e683eee60f93a83e8c58f7cb5
Notice: /Stage[main]/Apt/File[/etc/apt/sources.list]/content: content changed '{md5}cf7d931e683eee60f93a83e8c58f7cb5' to '{md5}7a453aaac6c535c67868c78e995d8c84'
Info: /Stage[main]/Apt/File[/etc/apt/sources.list]: Scheduling refresh of Exec[apt-get update]

So it merely adjusted some comments in /etc/apt/sources.list but otherwise preserved the version provided by our base image.

Success!!

Thank you @Andrew !

Additional note, I filed that cause I lacked sudo rules when the instanced booted and on the new instance the rules are present. I can thus fix up the Puppet cert for the standalone puppet master.