Page MenuHomePhabricator

Upgrade codfw1dev to bullseye
Closed, ResolvedPublic

Description

In preparation for upgrading eqiad1 to the same, upgrade codfw1dev to bullseye.

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+35 -5
operations/puppetproduction+18 -6
operations/debs/prometheus-pdns-rec-exportermaster+27 -13
operations/puppetproduction+13 -7
operations/puppetproduction+5 -1
operations/puppetproduction+5 -10
operations/puppetproduction+19 -4
operations/puppetproduction+4 -3
operations/puppetproduction+504 -9
operations/puppetproduction+38 -2
operations/puppetproduction+1 -1
operations/puppetproduction+4 -0
operations/puppetproduction+0 -5
operations/puppetproduction+14 -2
operations/puppetproduction+23 -13
operations/puppetproduction+22 -13
operations/puppetproduction+216 -0
operations/puppetproduction+0 -1
operations/puppetproduction+1 -7
operations/puppetproduction+7 -1
operations/puppetproduction+3 -2
operations/puppetproduction+4 -2
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 757702 had a related patch set uploaded (by Michael DiPietro; author: Michael DiPietro):

[operations/puppet@production] upgrade codfw1dev to bullseye

https://gerrit.wikimedia.org/r/757702

Change 757702 merged by Michael DiPietro:

[operations/puppet@production] upgrade codfw1dev to bullseye

https://gerrit.wikimedia.org/r/757702

Change 757729 had a related patch set uploaded (by Michael DiPietro; author: Michael DiPietro):

[operations/puppet@production] upgrade codfw1dev to bullseye

https://gerrit.wikimedia.org/r/757729

Change 757729 merged by Michael DiPietro:

[operations/puppet@production] upgrade codfw1dev to bullseye

https://gerrit.wikimedia.org/r/757729

Change 757739 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] openstack: remove few more python 2 packages

https://gerrit.wikimedia.org/r/757739

Change 757742 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] backy2: don't install python3-crypto in bullseye

https://gerrit.wikimedia.org/r/757742

Change 757742 merged by Andrew Bogott:

[operations/puppet@production] backy2: don't install python3-crypto in bullseye

https://gerrit.wikimedia.org/r/757742

Change 757745 had a related patch set uploaded (by Michael DiPietro; author: Michael DiPietro):

[operations/puppet@production] upgrade codfw1dev to bullseye

https://gerrit.wikimedia.org/r/757745

Change 757739 merged by Michael DiPietro:

[operations/puppet@production] openstack: remove few more python 2 packages

https://gerrit.wikimedia.org/r/757739

Change 757745 abandoned by Michael DiPietro:

[operations/puppet@production] upgrade codfw1dev to bullseye

Reason:

https://gerrit.wikimedia.org/r/757745

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudcontrol2001-dev.wikimedia.org with OS buster

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudcontrol2001-dev.wikimedia.org with OS buster executed with errors:

  • cloudcontrol2001-dev (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudcontrol2001-dev.wikimedia.org with OS buster

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudcontrol2001-dev.wikimedia.org with OS buster executed with errors:

  • cloudcontrol2001-dev (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudcontrol2001-dev.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudcontrol2001-dev.wikimedia.org with OS bullseye executed with errors:

  • cloudcontrol2001-dev (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201272322_andrew_3448_cloudcontrol2001-dev.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by mdipietro@cumin1001 for host cloudcontrol2001-dev.wikimedia.org with OS bullseye executed with errors:

  • cloudcontrol2001-dev (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run failed, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201271818_mdipietro_17188_cloudcontrol2001-dev.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

The remaining Puppet failure seems to be caused by the Glance system user using an unusually high user ID:

taavi@cloudcontrol2001-dev ~ $ id glance
uid=64062(glance) gid=64062(glance) groups=64062(glance)

compared to for example

taavi@cloudcontrol2003-dev ~ $ id glance
uid=496(glance) gid=64062(glance) groups=64062(glance)

The current IDs seems to be hardcoded in the packaging:

# https://salsa.debian.org/openstack-team/debian/openstack-pkg-tools/-/blob/debian/victoria/pkgos_func#L855
	"glance")
		ADDGROUP_PARAM="--gid 64062"
		ADDUSER_PARAM="--uid 64062"
		;;

Not sure what changed between buster and bullseye, and why nova and cinder (others hardcoded in the same place) are using IDs under 500. Note that

taavi@cloudcontrol2001-dev ~ $ id 496
uid=496(srv-networktests) gid=1001(srv-networktests) groups=1001(srv-networktests)
taavi@cloudcontrol2003-dev ~ $ id srv-networktests 
uid=494(srv-networktests) gid=1001(srv-networktests) groups=1001(srv-networktests)

We've had other clashes in the past between LDAP users and openstack package's users, see T230003: openstack: cleanup neutron user and rOPUP4b793df105ab: openstack/buster/nova: Create 'nova' system user in puppet for a couple examples.

Change 757899 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] admin: enforce-users-groups: Add new system user range

https://gerrit.wikimedia.org/r/757899

Change 757899 merged by Andrew Bogott:

[operations/puppet@production] admin: enforce-users-groups: Add new system user range

https://gerrit.wikimedia.org/r/757899

I spent most of my day yesterday fighting with Galera, trying to get it to sync over the boundary between mariadb 10.3 (buster) and mariadb 10.5 (bullseye).

It CAN work; Galera doesn't have a problem. What is a problem is the crashlog; mariadb got upset on the new host and tried to replay the crashlog, which is strictly versioned. This resulted in some angry messages about 'You can't upgrade and recover from a crash at the same time' and a failure to start.

I am pretty sure that the right solution for this is an in-place upgrade of all nodes before reimaging. Then we can follow this guide to get things syncing again:

https://fromdual.com/upgrading-from-mariadb-10.4-to-mariadb-10.5-galera-cluster

Once all the nodes are on 10.5 we should be able to reimage any one node without causing a version-mismatch freakout.

Change 757956 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Keystone: don't use apache/mod_wsgi on bullseye

https://gerrit.wikimedia.org/r/757956

Change 757956 merged by Andrew Bogott:

[operations/puppet@production] Keystone: don't use apache/mod_wsgi on bullseye

https://gerrit.wikimedia.org/r/757956

Change 757959 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Exclude keystone::apache profile on Bullseye hosts

https://gerrit.wikimedia.org/r/757959

Change 757959 merged by Andrew Bogott:

[operations/puppet@production] Exclude keystone::apache profile on Bullseye hosts

https://gerrit.wikimedia.org/r/757959

Change 757962 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Another piece of removing Apache from cloudcontrol/bullseye

https://gerrit.wikimedia.org/r/757962

Change 757962 merged by Andrew Bogott:

[operations/puppet@production] Another piece of removing Apache from cloudcontrol/bullseye

https://gerrit.wikimedia.org/r/757962

Change 757968 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] profile::openstack::eqiad1::keystone::wsgi_server: 'keystone'

https://gerrit.wikimedia.org/r/757968

Change 757968 merged by Andrew Bogott:

[operations/puppet@production] profile::openstack::eqiad1::keystone::wsgi_server: 'keystone'

https://gerrit.wikimedia.org/r/757968

Change 757969 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] typo fix: apache2/keystone

https://gerrit.wikimedia.org/r/757969

Change 757969 merged by Andrew Bogott:

[operations/puppet@production] typo fix: apache2/keystone

https://gerrit.wikimedia.org/r/757969

Change 757982 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] openstack: keystone: set bind port for uwsgi process

https://gerrit.wikimedia.org/r/757982

Change 757982 merged by Andrew Bogott:

[operations/puppet@production] openstack: keystone: set bind port for uwsgi process

https://gerrit.wikimedia.org/r/757982

Change 758004 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Keystone/victoria/bullseye: brute-force replace init scripts

https://gerrit.wikimedia.org/r/758004

Change 758004 merged by Andrew Bogott:

[operations/puppet@production] Keystone/victoria/bullseye: brute-force replace init scripts

https://gerrit.wikimedia.org/r/758004

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudcontrol2003-dev.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudcontrol2003-dev.wikimedia.org with OS bullseye executed with errors:

  • cloudcontrol2003-dev (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run failed, asking the operator what to do
    • First Puppet run failed, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201290445_andrew_24112_cloudcontrol2003-dev.out
    • The reimage failed, see the cookbook logs for the details

Change 758036 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] openstack: force start barbican and trove services

https://gerrit.wikimedia.org/r/758036

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudcontrol2004-dev.wikimedia.org with OS bullseye

Change 758036 merged by Andrew Bogott:

[operations/puppet@production] openstack: force start barbican and trove services

https://gerrit.wikimedia.org/r/758036

Change 758049 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:openstack: fix novaenv path

https://gerrit.wikimedia.org/r/758049

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudcontrol2004-dev.wikimedia.org with OS bullseye completed:

  • cloudcontrol2004-dev (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201291656_andrew_26799_cloudcontrol2004-dev.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudservices2003-dev.wikimedia.org with OS bullseye

Change 758052 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:openstack::galera: fix monitoring process name on bullseye

https://gerrit.wikimedia.org/r/758052

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudservices2003-dev.wikimedia.org with OS bullseye executed with errors:

  • cloudservices2003-dev (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run failed, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201291804_andrew_5834_cloudservices2003-dev.out
    • The reimage failed, see the cookbook logs for the details

Change 758063 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] pdns: support bullseye

https://gerrit.wikimedia.org/r/758063

Change 758068 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/debs/prometheus-pdns-rec-exporter@master] Bare minimun port to Python 3 to support Debian Bullseye

https://gerrit.wikimedia.org/r/758068

Change 758510 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] O:openstack::services: don't use pdns prometheus exporters on bullseye

https://gerrit.wikimedia.org/r/758510

Change 758510 merged by Andrew Bogott:

[operations/puppet@production] O:openstack::services: don't use pdns prometheus exporters on bullseye

https://gerrit.wikimedia.org/r/758510

Change 758052 merged by Andrew Bogott:

[operations/puppet@production] P:openstack::galera: fix monitoring process name on bullseye

https://gerrit.wikimedia.org/r/758052

Change 758049 merged by Andrew Bogott:

[operations/puppet@production] P:openstack: fix novaenv path

https://gerrit.wikimedia.org/r/758049

Change 758063 merged by Ssingh:

[operations/puppet@production] pdns: support bullseye

https://gerrit.wikimedia.org/r/758063

Change 758068 abandoned by Majavah:

[operations/debs/prometheus-pdns-rec-exporter@master] Bare minimum port to Python 3 to support Debian Bullseye

Reason:

Using the internal prometheus metrics in powerdns seems indeed the best solution. Thanks!

https://gerrit.wikimedia.org/r/758068

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudnet2004-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudnet2004-dev.codfw.wmnet with OS bullseye executed with errors:

  • cloudnet2004-dev (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run failed, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201312212_andrew_18938_cloudnet2004-dev.out
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudnet2004-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudnet2004-dev.codfw.wmnet with OS bullseye executed with errors:

  • cloudnet2004-dev (FAIL)
    • Downtimed on Icinga
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run failed, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202010218_andrew_21035_cloudnet2004-dev.out
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudnet2004-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudnet2004-dev.codfw.wmnet with OS bullseye executed with errors:

  • cloudnet2004-dev (FAIL)
    • Downtimed on Icinga
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run failed, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202010337_andrew_2994_cloudnet2004-dev.out
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudnet2004-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudnet2004-dev.codfw.wmnet with OS bullseye completed:

  • cloudnet2004-dev (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202011726_andrew_25862_cloudnet2004-dev.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudnet2002-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudnet2002-dev.codfw.wmnet with OS bullseye executed with errors:

  • cloudnet2002-dev (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202012155_andrew_8304_cloudnet2002-dev.out
    • The reimage failed, see the cookbook logs for the details

@Andrew Hi! I have acked some alerts in icinga, they show up as "silenced" but if possible (during this task) could you ack them once in a while to reduce the noise in the icinga page?

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudservices2002-dev.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudservices2002-dev.wikimedia.org with OS bullseye executed with errors:

  • cloudservices2002-dev (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202041852_andrew_15502_cloudservices2002-dev.out
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt2001-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt2001-dev.codfw.wmnet with OS bullseye executed with errors:

  • cloudvirt2001-dev (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt2001-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt2001-dev.codfw.wmnet with OS bullseye completed:

  • cloudvirt2001-dev (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202050611_andrew_14326_cloudvirt2001-dev.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt2001-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt2001-dev.codfw.wmnet with OS bullseye completed:

  • cloudvirt2001-dev (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202051753_andrew_16747_cloudvirt2001-dev.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt2002-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt2002-dev.codfw.wmnet with OS bullseye completed:

  • cloudvirt2002-dev (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202051929_andrew_31252_cloudvirt2002-dev.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt2003-dev.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt2003-dev.codfw.wmnet with OS bullseye completed:

  • cloudvirt2003-dev (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202052128_andrew_17772_cloudvirt2003-dev.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

This is done for all cloudcontrol, cloudservices, cloudnet, and cloudvirt hosts in codfw1dev. The puppet manifests have been updated so we should be able to do the same in eqiad.

I managed to totally break/lose the existing cloud puppetmaster while doing this but I don't think there's any real lesson to learn from that other than 'make sure cloudvirts are evacuated before rebuilding'.

One important caveats for doing this in eqiad1: Galera is happy with in-place upgrades but does not like to sync from an existing Buster node to a new Bullseye node. The solution is to upgrade all cloudcontrols in place first and then reimage one by one.

Change 761313 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] openstack pdns_auth: fix prometheus monitoring

https://gerrit.wikimedia.org/r/761313

Change 761315 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] dnsrecursor: add built-in webserver support

https://gerrit.wikimedia.org/r/761315

Change 761313 merged by Andrew Bogott:

[operations/puppet@production] openstack pdns_auth: fix prometheus monitoring

https://gerrit.wikimedia.org/r/761313

Change 761315 merged by Andrew Bogott:

[operations/puppet@production] dnsrecursor: add built-in webserver support

https://gerrit.wikimedia.org/r/761315

Change 763612 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] dnsrecursor: change webserver listening address

https://gerrit.wikimedia.org/r/763612

Change 763612 merged by Andrew Bogott:

[operations/puppet@production] dnsrecursor: change webserver listening address

https://gerrit.wikimedia.org/r/763612

Change 768770 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:prometheus::ops: fix powerdns-auth port

https://gerrit.wikimedia.org/r/768770

Change 768770 merged by Andrew Bogott:

[operations/puppet@production] P:prometheus::ops: fix powerdns-auth port

https://gerrit.wikimedia.org/r/768770

Andrew claimed this task.