⚓ T345810 [openstack] Upgrade codfw hosts to bookworm

Subject	Repo	Branch	Lines +/-
[openstack] bridge-utils config in bookworm	operations/puppet	production	+14 -1
Galera: allow installing debian-hosted packages for Bookworm or later	operations/puppet	production	+11 -8
ceph::common: allow bookworm and later versions	operations/puppet	production	+1 -2
[openstack] Add Zed manifests for Bookworm	operations/puppet	production	+292 -0
[openstack] Replace OS version in new manifests	operations/puppet	production	+191 -162

Status	Assigned	Task
Resolved	fnegri	T341285 Upgrade cloud-vps openstack to version 'Antelope'
Resolved	fnegri	T345810 [openstack] Upgrade codfw hosts to bookworm
Resolved	fnegri	T346762 Package mcrouter for Debian Bookworm
Resolved	Andrew	T347555 [openstack] LDAP is broken in codfw
Resolved	fnegri	T347740 wmfbackups packages for Debian Bookworm
Resolved	fnegri	T347856 codfw1dev: we lost the PDNS database content
Resolved	fnegri	T347861 [codfw1dev] DNS fails to resolve some addresses
Resolved	None	T347880 codfw1dev: git tree out of sync

Even after the new patch, Puppet is still failing with

Error while evaluating a Function Call:
node codename does not meet requirement `bookworm <= bullseye`
…in /etc/puppet/modules/debian/functions/codename/require.pp, line: 23, column: 9.

Maintenance_bot removed a project: Patch-For-Review.Sep 7 2023, 5:11 PM

In T345810#9150510, @fnegri wrote:
Even after the new patch, Puppet is still failing with
Error while evaluating a Function Call:
node codename does not meet requirement `bookworm <= bullseye`
…in /etc/puppet/modules/debian/functions/codename/require.pp, line: 23, column: 9.

That seems to be a result of https://gerrit.wikimedia.org/r/c/operations/puppet/+/765536 which insists that galera only be installed on Bullseye. Hopefully @aborrero will know if we still need that, or need packages built for bookworm, or what.

On Bullseye (e.g. cloudcontrol1007):

root@cloudcontrol1007:~# dpkg --list | grep galera
ii  galera-4                             26.4.11-bullseye                          amd64        Replication framework for transactional applications

Default bookworm policy will install 26.4.13-1. So probably we can move forward with the attached patch and use the default packages.

Change 955841 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Galera: allow installing debian-hosted packages for Bookworm or later

https://gerrit.wikimedia.org/r/955841

gerritbot added a project: Patch-For-Review.Sep 7 2023, 8:50 PM

Change 955902 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] ceph::common: allow bookworm and later versions

https://gerrit.wikimedia.org/r/955902

Change 955902 merged by FNegri:

[operations/puppet@production] ceph::common: allow bookworm and later versions

https://gerrit.wikimedia.org/r/955902

Change 955841 merged by FNegri:

[operations/puppet@production] Galera: allow installing debian-hosted packages for Bookworm or later

https://gerrit.wikimedia.org/r/955841

Maintenance_bot removed a project: Patch-For-Review.Sep 8 2023, 4:11 PM

The next remaining piece is that we need 'prometheus-memcached-exporter' for bookworm. Might be a simple backport.

Mentioned in SAL (#wikimedia-operations) [2023-09-08T17:13:16Z] <taavi> reprepro copy bookworm-wikimedia bullseye-wikimedia prometheus-memcached-exporter # T345810

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudcontrol2001-dev.codfw.wmnet with OS bookworm completed:

cloudcontrol2001-dev (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202309071032_fnegri_3525434_cloudcontrol2001-dev.out, asking the operator what to do
- First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202309071209_fnegri_3525434_cloudcontrol2001-dev.out, asking the operator what to do
- First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202309071611_fnegri_3525434_cloudcontrol2001-dev.out, asking the operator what to do
- First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202309071634_fnegri_3525434_cloudcontrol2001-dev.out, asking the operator what to do
- First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202309071635_fnegri_3525434_cloudcontrol2001-dev.out, asking the operator what to do
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309081553_fnegri_3525434_cloudcontrol2001-dev.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Galera is refusing to cluster between the bullseye and bookworm versions. There's some suggesting in docs and error messages that it /will/ cluster across the boundary if there's a 'graceful shutdown' beforehand.

So I'm going to roll 2001-dev back to Bullseye and try stopping Mariadb and then doing an in-place upgrade. If that works then there will be a clear (if tedious) path forward.

Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1001 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1001 for host cloudcontrol2001-dev.codfw.wmnet with OS bullseye completed:

cloudcontrol2001-dev (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309090138_root_2488026_cloudcontrol2001-dev.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

the in-place upgrade to mariadb-server 10.11.3-1 seems to have worked; it's running and shows wsrep_local_state_comment | Synced

So... the next steps are:

in-place upgrade of mariadb-server on cloudcontrol2004-dev and cloudcontrol2005-dev
confirm syncing still working
fresh re-image of cloudcontrol2001-dev to Boookworm
confirm syncing
THEN we can upgrade the other two cloudcontrols to Bookworm as well.

I may or may not start that process tomorrow.

Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1001 for host cloudcontrol2001-dev.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1001 for host cloudcontrol2001-dev.codfw.wmnet with OS bookworm completed:

cloudcontrol2001-dev (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309091544_root_2895661_cloudcontrol2001-dev.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1001 for host cloudcontrol2004-dev.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1001 for host cloudcontrol2004-dev.codfw.wmnet with OS bookworm completed:

cloudcontrol2004-dev (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309091654_root_2932873_cloudcontrol2004-dev.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1001 for host cloudcontrol2005-dev.codfw.wmnet with OS bookworm

All codfw1dev cloudcontrol hosts (2001/2004/2005) are now running Bookworm and have Galera properly sync'd.

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1001 for host cloudcontrol2005-dev.codfw.wmnet with OS bookworm completed:

cloudcontrol2005-dev (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309091938_root_3011770_cloudcontrol2005-dev.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Thanks @Andrew for fixing cloudcontrol nodes! I tried to find out which other hosts must be reimaged, and I think the answer is "any node with the class openstack::serverpackages::zed::bullseye:

fnegri@cloudcumin1001:~$ sudo cumin 'P{C:openstack::serverpackages::zed::bullseye} and A:codfw'
7 hosts will be targeted:
cloudnet[2005-2006]-dev.codfw.wmnet,cloudservices[2004-2005]-dev.codfw.wmnet,cloudvirt[2001-2003]-dev.codfw.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudnet2005-dev.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudnet2005-dev.codfw.wmnet with OS bookworm executed with errors:

cloudnet2005-dev (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudnet2005-dev.codfw.wmnet with OS bookworm

Puppet is failing to run on cloudnet2005-dev after the reimage with the following error:

node codename does not meet requirement `bookworm == bullseye`

I'm struggling to find where that requirement is defined.

Change 956832 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] [openstack] bridge-utils workaround is bullseye-only

https://gerrit.wikimedia.org/r/956832

gerritbot added a project: Patch-For-Review.Sep 12 2023, 11:03 AM

Change 956832 merged by FNegri:

[operations/puppet@production] [openstack] bridge-utils config in bookworm

https://gerrit.wikimedia.org/r/956832

Maintenance_bot removed a project: Patch-For-Review.Sep 12 2023, 4:10 PM

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudnet2005-dev.codfw.wmnet with OS bookworm completed:

cloudnet2005-dev (WARN)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run failed and logged in /var/log/spicerack/sre/hosts/reimage/202309111632_fnegri_570806_cloudnet2005-dev.out, asking the operator what to do
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309121610_fnegri_570806_cloudnet2005-dev.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudnet2006-dev.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudnet2006-dev.codfw.wmnet with OS bookworm completed:

cloudnet2006-dev (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309131745_fnegri_3372155_cloudnet2006-dev.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-14T16:25:19Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.drain (T345810)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-14T16:25:49Z] <fnegri@cloudcumin1001> END (ERROR) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=97) (T345810)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-14T16:29:48Z] <wm-bot2> fran@wmf3169 START - Cookbook wmcs.openstack.cloudvirt.drain (T345810)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-14T16:29:52Z] <wm-bot2> fran@wmf3169 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) (T345810)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-14T16:51:31Z] <wm-bot2> fran@wmf3169 START - Cookbook wmcs.openstack.cloudvirt.drain (T345810)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-14T16:51:40Z] <wm-bot2> fran@wmf3169 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) (T345810)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-14T16:56:54Z] <wm-bot2> fran@wmf3169 START - Cookbook wmcs.openstack.cloudvirt.drain (T345810)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-14T17:01:24Z] <wm-bot2> fran@wmf3169 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) (T345810)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-14T17:06:18Z] <wm-bot2> fran@wmf3169 START - Cookbook wmcs.openstack.cloudvirt.drain (T345810)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-14T17:13:13Z] <wm-bot2> fran@wmf3169 admin END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) (T345810)

cloudcontrols, cloudnets and cloudvirts have all been reimaged.

We still need to reimage cloudbackups, cloudservices and cloudwebs.

We don't plan to reimage cloudcephmons, cloudcephosds and clouddbs yet, because they're not running any Openstack components, as far as I know.

Recap of the current OS for cloud* hosts (excluding the "insetup" ones):

fnegri@cloudcumin1001:~$ sudo cumin 'P{cloud*} and A:codfw and not P{O:insetup::wmcs}' 'cat /etc/debian_version'

[...]

===== NODE GROUP =====
(11) cloudcontrol[2001,2004-2005]-dev.codfw.wmnet,cloudnet[2005-2006]-dev.codfw.wmnet,cloudvirt[2001-2006]-dev.codfw.wmnet
----- OUTPUT of 'cat /etc/debian_version' -----
12.1
===== NODE GROUP =====
(7) cloudbackup[2001-2002].codfw.wmnet,clouddb2002-dev.codfw.wmnet,cloudgw[2002-2003]-dev.codfw.wmnet,cloudservices[2004-2005]-dev.codfw.wmnet
----- OUTPUT of 'cat /etc/debian_version' -----
11.7
===== NODE GROUP =====
(7) cloudcephmon[2004-2006]-dev.codfw.wmnet,cloudcephosd[2001-2003]-dev.codfw.wmnet,cloudweb2002-dev.wikimedia.org
----- OUTPUT of 'cat /etc/debian_version' -----
10.13

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudbackup1001-dev.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudbackup1001-dev.eqiad.wmnet with OS bookworm completed:

cloudbackup1001-dev (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309181401_fnegri_3707187_cloudbackup1001-dev.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudservices2004-dev.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudservices2004-dev.codfw.wmnet with OS bookworm executed with errors:

cloudservices2004-dev (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309181441_fnegri_3715108_cloudservices2004-dev.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details

fnegri changed the status of subtask T346762: Package mcrouter for Debian Bookworm from Open to In Progress.Sep 19 2023, 2:08 PM

aborrero mentioned this in T339894: cloudservices: codfw1dev: fix backups.Sep 20 2023, 8:57 AM

Please note backups are failing for cloudservices2004-dev.codfw.wmnet-Monthly-1st-Fri-productionEqiad-openldap : T339894#9181716

fnegri changed the task status from In Progress to Stalled.Sep 21 2023, 2:58 PM

fnegri moved this task from In progress to Blocked on the cloud-services-team (FY2023/2024-Q1-Q2) board.

@jcrespo thanks for spotting the backup issue, unfortunately cloudservices2004-dev is currently broken and I cannot fix before T346762: Package mcrouter for Debian Bookworm is resolved.

I wonder if we could move the backups to cloudservices2005-dev (which I haven't reimaged yet and is still on Bullseye), but I don't know how to do it.

As an alternative, I could reimage 2004 back to Bullseye until we are ready to move to Bookworm.

Third option: accept that backups will be broken for a few more days.

fnegri closed subtask T346762: Package mcrouter for Debian Bookworm as Resolved.Sep 26 2023, 9:43 AM

fnegri changed the status of subtask T347555: [openstack] LDAP is broken in codfw from Open to In Progress.Sep 28 2023, 9:28 AM

Andrew closed subtask T347555: [openstack] LDAP is broken in codfw as Resolved.Sep 28 2023, 10:46 PM

fnegri changed the task status from Stalled to In Progress.Sep 29 2023, 10:05 AM

fnegri moved this task from Blocked to In progress on the cloud-services-team (FY2023/2024-Q1-Q2) board.

LDAP issues were fixed in T347555, now Puppet has only one failure in cloudservices200[45]:

E: Unable to locate package wmfbackups

I created T347740: wmfbackups packages for Debian Bookworm

fnegri changed the task status from In Progress to Stalled.Sep 29 2023, 4:41 PM

fnegri raised the priority of this task from Medium to High.

fnegri changed the status of subtask T347740: wmfbackups packages for Debian Bookworm from Open to In Progress.

fnegri moved this task from In progress to Blocked on the cloud-services-team (FY2023/2024-Q1-Q2) board.

fnegri mentioned this in T347740: wmfbackups packages for Debian Bookworm.Oct 2 2023, 10:09 AM

aborrero closed subtask T347856: codfw1dev: we lost the PDNS database content as Invalid.Oct 2 2023, 12:14 PM

fnegri changed the task status from Stalled to In Progress.Oct 2 2023, 1:16 PM

fnegri moved this task from Blocked to In progress on the cloud-services-team (FY2023/2024-Q1-Q2) board.

fnegri changed the status of subtask T347861: [codfw1dev] DNS fails to resolve some addresses from Open to In Progress.Oct 2 2023, 1:32 PM

Until T347740 is resolved I have manually installed wmfbackups on cloudservices200[45] and Puppet is now running successfully on those hosts.

We have now reimaged all the codfw hosts that include Openstack packages:

fnegri@cumin1001:~$ sudo cumin -x 'A:codfw and P{cloud*} and not P{O:insetup::wmcs} and not A:bookworm' 'ls /etc/apt/sources.list.d/openstack*'
[...]
===== NODE GROUP =====
(16) cloudbackup[2001-2002].codfw.wmnet,cloudcephmon[2004-2006]-dev.codfw.wmnet,cloudcephosd[2001-2003]-dev.codfw.wmnet,cloudcumin2001.codfw.wmnet,clouddb2002-dev.codfw.wmnet,cloudgw[2002-2003]-dev.codfw.wmnet,cloudlb[2001-2003]-dev.codfw.wmnet,cloudweb2002-dev.wikimedia.org
----- OUTPUT of 'ls /etc/apt/sour...ist.d/openstack*' -----
ls: cannot access '/etc/apt/sources.list.d/openstack*': No such file or directory

fnegri@cumin1001:~$ sudo cumin -x 'A:codfw and P{cloud*} and not P{O:insetup::wmcs} and A:bookworm' 'ls /etc/apt/sources.list.d/openstack*'
[...]
===== NODE GROUP =====
(13) cloudcontrol[2001,2004-2005]-dev.codfw.wmnet,cloudnet[2005-2006]-dev.codfw.wmnet,cloudservices[2004-2005]-dev.codfw.wmnet,cloudvirt[2001-2006]-dev.codfw.wmnet
----- OUTPUT of 'ls /etc/apt/sour...ist.d/openstack*' -----
/etc/apt/sources.list.d/openstack-zed-bookworm-nochange.sources
/etc/apt/sources.list.d/openstack-zed-bookworm.sources
================

fnegri added a subtask: T347880: codfw1dev: git tree out of sync.Oct 2 2023, 5:25 PM

Before resolving this task, we should check that all the Openstack components are working correctly after the Bookworm reimage.

This is the result of the cookbook wmcs.openstack.network.tests:

fnegri@cloudcumin1001:~$ sudo cookbook wmcs.openstack.network.tests
START - Cookbook wmcs.openstack.network.tests
----- OUTPUT of 'sudo -i cmd-chec...etworktests.yaml' -----
[2023-10-02 17:22:04] INFO: --- cloudcontrol2004-dev Debian GNU/Linux 12 (bookworm) 6.1.0-11-amd64
[2023-10-02 17:22:04] INFO: ---
[2023-10-02 17:22:04] INFO: running: basic ping to cloudgw addresses (raw addresses) from outside the cloud network
[2023-10-02 17:22:04] INFO: running: basic ping to cloudgw addresses (DNS names) from outside the cloud network
[2023-10-02 17:22:04] INFO: running: basic ping to neutron WAN from outside the cloud network
[2023-10-02 17:22:04] INFO: running: basic ping to neutron VIRT gateway from within the cloud virtual network, no floating IP
[2023-10-02 17:22:07] INFO: running: basic ping to neutron VIRT gateway from within the cloud virtual network, with floating IP
[2023-10-02 17:22:10] INFO: running: VM (no floating IP) contacting the internet gets NAT'd using routing_source_ip
[2023-10-02 17:22:11] INFO: running: VM (no floating IP) contacting an address covered by dmz_cidr doesn't get NAT'd
[2023-10-02 17:22:12] INFO: running: VM (using floating IP) isn't affected by either routing_source_ip or dmz_cidr
[2023-10-02 17:22:15] INFO: running: VM (no floating IP) can contact auth DNS server
[2023-10-02 17:22:16] INFO: running: VM (no floating IP) can contact recursor DNS server
[2023-10-02 17:22:18] INFO: running: VM (using floating IP) can contact auth DNS server
[2023-10-02 17:22:19] INFO: running: VM (using floating IP) can contact recursor DNS server
[2023-10-02 17:22:21] INFO: running: VM (using floating IP) can contact LDAP server
[2023-10-02 17:22:22] INFO: running: VM (not using floating IP) can contact LDAP server
[2023-10-02 17:22:23] INFO: running: VM (using floating IP) can contact openstack API
[2023-10-02 17:22:25] INFO: running: VM (no floating IP) can contact openstack API
[2023-10-02 17:22:26] INFO: running: puppetmasters can sync git tree
[2023-10-02 17:22:40] WARNING: cmd '/usr/bin/ssh -i /etc/networktests/sshkeyfile -o User=srv-networktests -q -o ConnectTimeout=5 -o NumberOfPasswordPrompts=0 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR -o Proxycommand="ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR -i /etc/networktests/sshkeyfile -W %h:%p srv-networktests@bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org" cloudinfra-internal-puppetmaster-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud 'sudo git-sync-upstream 2>&1 | grep -q Up-to-date'', expected return code '0', but got '1'
[2023-10-02 17:22:40] WARNING: failed test: puppetmasters can sync git tree
[2023-10-02 17:22:40] INFO: running: VM (using floating IP) can read dumps NFS
[2023-10-02 17:22:43] INFO: running: VM (no floating IP) can read dumps NFS
[2023-10-02 17:22:44] INFO: ---
[2023-10-02 17:22:44] INFO: --- passed tests: 18
[2023-10-02 17:22:44] INFO: --- failed tests: 1
[2023-10-02 17:22:44] INFO: --- total tests: 19
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -i cmd-chec...etworktests.yaml'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Cloud VPS network tests: 1 failed tests detected!
END (FAIL) - Cookbook wmcs.openstack.network.tests (exit_code=1)

And this is the result of the Terraform tests (tf-infra-test, running in paws-dev-bastion:/home/rook/tf-infra-test):

openstack_containerinfra_clustertemplate_v1.template_123: Creating...
openstack_db_instance_v1.postgresql: Creating...
openstack_db_instance_v1.mariadb: Creating...
openstack_compute_instance_v2.vm: Creating...
openstack_networking_floatingip_v2.floating_ip: Creating...
openstack_blockstorage_volume_v3.volume: Creating...
openstack_db_instance_v1.mysql: Creating...
cloudvps_puppet_prefix.terraform: Creating...
cloudvps_puppet_prefix.terraform: Creation complete after 0s [name=terraform-]
openstack_containerinfra_clustertemplate_v1.template_123: Creation complete after 4s [id=3687faec-942a-436e-83db-976f3517daea]
openstack_containerinfra_cluster_v1.k8s_123: Creating...
openstack_networking_floatingip_v2.floating_ip: Creation complete after 7s [id=990e112d-2c1c-46a2-a031-20a63116eb4a]
openstack_db_instance_v1.postgresql: Still creating... [10s elapsed]
openstack_db_instance_v1.mariadb: Still creating... [10s elapsed]
openstack_compute_instance_v2.vm: Still creating... [10s elapsed]
openstack_blockstorage_volume_v3.volume: Still creating... [10s elapsed]
openstack_db_instance_v1.mysql: Still creating... [10s elapsed]
openstack_blockstorage_volume_v3.volume: Creation complete after 11s [id=b5b58484-4527-44be-9a79-d42f22e3b834]
openstack_containerinfra_cluster_v1.k8s_123: Still creating... [10s elapsed]
openstack_compute_instance_v2.vm: Creation complete after 15s [id=eb655973-808c-4397-87a0-b7747a7e0a9e]
openstack_compute_floatingip_associate_v2.floating_ip: Creating...
openstack_compute_volume_attach_v2.va_1: Creating...
cloudvps_web_proxy.web_proxy: Modifying...
cloudvps_web_proxy.web_proxy: Modifications complete after 1s
openstack_compute_floatingip_associate_v2.floating_ip: Creation complete after 4s [id=185.15.57.20/eb655973-808c-4397-87a0-b7747a7e0a9e/]
openstack_db_instance_v1.postgresql: Still creating... [20s elapsed]
openstack_db_instance_v1.mariadb: Still creating... [20s elapsed]
openstack_db_instance_v1.mysql: Still creating... [20s elapsed]
openstack_compute_volume_attach_v2.va_1: Creation complete after 7s [id=eb655973-808c-4397-87a0-b7747a7e0a9e/b5b58484-4527-44be-9a79-d42f22e3b834]

[...]

openstack_db_instance_v1.mariadb: Creation complete after 3m3s [id=9b55e7bd-f2be-43cf-947d-26b9d3b70871]
openstack_containerinfra_cluster_v1.k8s_123: Still creating... [3m0s elapsed]
openstack_db_instance_v1.mysql: Still creating... [3m10s elapsed]
openstack_db_instance_v1.mysql: Creation complete after 3m12s [id=33c6a9e2-1ad9-42c7-8855-149f6223a1b1]

[...]

openstack_containerinfra_cluster_v1.k8s_123: Creation complete after 11m56s [id=b2f93184-345a-44d1-9718-9ba1303871de]
╷
│ Error: Error waiting for openstack_db_instance_v1 f34fb8b0-da0b-45a0-955b-a852a3b31356 to become ready: unexpected state 'ERROR', wanted target 'ACTIVE, HEALTHY'. last error: %!s(<nil>)
│
│   with openstack_db_instance_v1.postgresql,
│   on trove.tf line 51, in resource "openstack_db_instance_v1" "postgresql":
│   51: resource "openstack_db_instance_v1" "postgresql" {
│

The only error in the tf-infra-test is the Postgresql one, that was already present before the reimage.

The only error in the network tests is the one already tracked in T347880: codfw1dev: git tree out of sync

Andrew closed subtask T347880: codfw1dev: git tree out of sync as Resolved.Oct 2 2023, 6:55 PM

I think this can be resolved and we can continue with T341285: Upgrade cloud-vps openstack to version 'Antelope'.

fnegri changed the status of subtask T347856: codfw1dev: we lost the PDNS database content from Invalid to Resolved.Oct 3 2023, 6:00 PM

fnegri closed subtask T347861: [codfw1dev] DNS fails to resolve some addresses as Resolved.Oct 4 2023, 1:37 PM

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudbackup1002-dev.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudbackup1002-dev.eqiad.wmnet with OS bookworm executed with errors:

cloudbackup1002-dev (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and delete any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202310180910_fnegri_2557740_cloudbackup1002-dev.out
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudbackup1002-dev.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudbackup1002-dev.eqiad.wmnet with OS bookworm executed with errors:

cloudbackup1002-dev (FAIL)
- Downtimed on Icinga/Alertmanager
- Unable to disable Puppet, the host may have been unreachable
- Removed from Puppet and PuppetDB if present and delete any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202310181019_fnegri_2592452_cloudbackup1002-dev.out
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1001 for host cloudbackup1002-dev.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1001 for host cloudbackup1002-dev.eqiad.wmnet with OS bookworm completed:

cloudbackup1002-dev (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202310181114_fnegri_2621058_cloudbackup1002-dev.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

I just reimaged cloudbackup1002-dev because I realized I had reimaged cloudbackup1001-dev but forgot about 1002.

fnegri mentioned this in T345811: [openstack] Upgrade eqiad hosts to bookworm.Oct 26 2023, 5:12 PM

fnegri closed subtask T347740: wmfbackups packages for Debian Bookworm as Resolved.Nov 6 2023, 2:30 PM

[openstack] Upgrade codfw hosts to bookworm
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

	fnegri
	Sep 7 2023, 10:01 AM

[openstack] Upgrade codfw hosts to bookwormClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

[openstack] Upgrade codfw hosts to bookworm
Closed, ResolvedPublic
Actions

Related Objects
Search...