⚓ T305446 Upgrade backup* hosts to bullseye

Subject	Repo	Branch	Lines +/-
install_server: Update backup-format recipe to install on sdb/sdc	operations/puppet	production	+7 -7
install_server: Wipe backup1002 completely	operations/puppet	production	+1 -1
mediabackup: Clone localy the mediawiki-config repo	operations/puppet	production	+7 -0
admin: Add placeholder to reserve uid and gid 914 for minio-user	operations/puppet	production	+6 -0
mediabackup: Hide diffs from mc config file	operations/puppet	production	+8 -7
mediabackups: Fix formatting and syntax error on mc config template	operations/puppet	production	+1 -1
mediabackup: Preconfigure mc client config on worker nodes	operations/puppet	production	+57 -19
install_server: Disable wiping of backup[12]008 after bullseye upgrade	operations/puppet	production	+2 -2

Status	Assigned	Task
Open	None	T291916 Tracking task for Bullseye migrations in production
Resolved	• Marostegui	T298585 Upgrade WMF database-and-backup-related hosts to bullseye
Resolved	jcrespo	T305446 Upgrade backup* hosts to bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin2002 for host backup2007.codfw.wmnet with OS bullseye completed:

backup2007 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204110859_jynus_1923350_backup2007.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

jcrespo updated the task description. (Show Details)Apr 11 2022, 9:41 AM

@MoritzMuehlenhoff I need your advice here. When reimaging backup hosts (and preserving its data dir), the debmonitor user takes over the uid of the minio-user. This would require changing the owner of the files- but this doesn't scale- it took over a week, even with multiple threads in parallel to update the hundreds of millions of backup files in backup2007 (I went throught it as I was already too deep into it when I realized it). During the maintenance, minio has to be down. Note that that is not exclusive to backup hosts- but it is worse there as we have higher consolidation than on swift, as backups are not focused on performance (slow disks with lots of data).

The way I see it, there are a few alternatives:

After reimage, the debmonitor user gets switched with the minio-user's UID and you help me fix debmonitor (which should be easier) rather than trying to fix minio permissions
Create a hack for the user in post-inst script installer, like swift hosts do
Perform an upgrade "in place" so basic config files don't get recreated, including /etc/passwd
Any other option, including some kind of support on the reimage process/puppet for stable UIDs

I would prefer #1 for now to unblock the upgrades now, as it would be -I think- the easier, but as the owner of debmonitor, I would like your thoughts on this. Ideally, a better solution would be available in the long run.

In T305446#7863673, @jcrespo wrote:

@MoritzMuehlenhoff I need your advice here. When reimaging backup hosts (and preserving its data dir), the debmonitor user takes over the uid of the minio-user. This would require changing the owner of the files- but this doesn't scale- it took over a week, even with multiple threads in parallel to update the hundreds of millions of backup files in backup2007 (I went throught it as I was already too deep into it when I realized it). During the maintenance, minio has to be down. Note that that is not exclusive to backup hosts- but it is worse there as we have higher consolidation than on swift, as backups are not focused on performance (slow disks with lots of data).

The way I see it, there are a few alternatives:

After reimage, the debmonitor user gets switched with the minio-user's UID and you help me fix debmonitor (which should be easier) rather than trying to fix minio permissions

Create a hack for the user in post-inst script installer, like swift hosts do

Perform an upgrade "in place" so basic config files don't get recreated, including /etc/passwd

Any other option, including some kind of support on the reimage process/puppet for stable UIDs

I would prefer #1 for now to unblock the upgrades now, as it would be -I think- the easier, but as the owner of debmonitor, I would like your thoughts on this. Ideally, a better solution would be available in the long run.

Ok, so for context, over the last two years John and myself made various changes how system users are handled, but some of those changes are only trickling in with reimages, so the state can be a little confusing depending on whether we're looking at an old system or a new one.

System users which are local to a specific host use an UID in the 100-499 range (servers installed before October 2019 lacked a config setting for adduser/systemd-sysuser which specified the intended rage). The only exception is systemd-coredump which uses 999 (it gets created early before we apply the config setting via Puppet)
Human users start at 1000 (as configured in modules/admin/data/data.yaml
There are a handful of UIDs which need to be consistent between more than one server. Typically that's because files are moved around via rsync or so, but minio falls into that area as well since the data persists a reimage. These fleet-wide UIDs are allocated in modules/admin/data/data.yaml as well and start at 901. You mentioned swift above which currently still uses the install hook hack, but it will also switch to that mechanism, the UID 902 is already claimed in data.yaml

My recommendation would be to:

Go with #1 for the current round of reimages to unblock you. You can easily change the debmonitor UID around, the only two directories/files you need to fix up with chown are 1) /etc/debmonitor (which holds the certs) and 2) /var/log/debmonitor-client. When done simply run "sudo debmonitor-client" and if it doesn't flag any errors, everything went fine.
Mid term also move minio to a fixed system UID. For that you can already claim it in data.yaml (next one would be 914, you can leave it commented out initially) and then server by server migrate the existing files to use 914. When all hosts are complete, you can then amend the systemd::sysuser config in class mediabackup::storage to use "id => '914:914'" and all future reimages will handle it correctly.

Thanks, Moritz, that's super useful. I will ping the rest of data persistence about this- I am not sure this is widely known (I didn't knew about it) and it will be useful for database backups and backup storage (mysql and backups user)- which ideally will also persist its UID within the fleet.

FWIW, I'm planning to also document our UID handling properly on wikitech, sometime in the next weeks.

Change 784633 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] admin: Add placeholder to reserve uid and git 914 for minio-user

https://gerrit.wikimedia.org/r/784633

gerritbot added a project: Patch-For-Review.Apr 20 2022, 8:48 AM

Thanks @MoritzMuehlenhoff. Kindly suggesting you to unsub from the ticket now, as reimage tickets usually have *lots* of probably irrelevant noise for you due to automation.

MoritzMuehlenhoff unsubscribed.Apr 20 2022, 9:10 AM

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1007.eqiad.wmnet with OS bullseye executed with errors:

backup1007 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1007.eqiad.wmnet with OS bullseye completed:

backup1007 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204201556_jynus_854386_backup1007.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Failed to get Netbox script results, try manually: https://netbox.wikimedia.org/api/extras/job-results/2903969/

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1007.eqiad.wmnet with OS bullseye executed with errors:

backup1007 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204201556_jynus_854386_backup1007.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Failed to get Netbox script results, try manually: https://netbox.wikimedia.org/api/extras/job-results/2903969/
- The reimage failed, see the cookbook logs for the details

jcrespo updated the task description. (Show Details)Apr 20 2022, 4:58 PM

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1006.eqiad.wmnet with OS bullseye

jcrespo updated the task description. (Show Details)Apr 21 2022, 6:28 AM

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1006.eqiad.wmnet with OS bullseye completed:

backup1006 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204210557_jynus_1195941_backup1006.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin2002 for host backup2006.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin2002 for host backup2006.codfw.wmnet with OS bullseye completed:

backup2006 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204210633_jynus_3624578_backup2006.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

jcrespo updated the task description. (Show Details)Apr 21 2022, 8:09 AM

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1005.eqiad.wmnet with OS bullseye

Change 784633 merged by Jcrespo:

[operations/puppet@production] admin: Add placeholder to reserve uid and gid 914 for minio-user

https://gerrit.wikimedia.org/r/784633

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1005.eqiad.wmnet with OS bullseye completed:

backup1005 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204210811_jynus_1217038_backup1005.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

jcrespo updated the task description. (Show Details)Apr 21 2022, 8:49 AM

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin2002 for host backup2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1004.eqiad.wmnet with OS bullseye

Maintenance_bot removed a project: Patch-For-Review.Apr 21 2022, 9:31 AM

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1004.eqiad.wmnet with OS bullseye completed:

backup1004 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204210855_jynus_1243423_backup1004.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

jcrespo updated the task description. (Show Details)Apr 21 2022, 9:37 AM

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin2002 for host backup2005.codfw.wmnet with OS bullseye completed:

backup2005 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204210853_jynus_3642224_backup2005.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin2002 for host backup2004.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye executed with errors:

backup1002 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- New OS is buster but bullseye was requested
- The reimage failed, see the cookbook logs for the details

jcrespo updated the task description. (Show Details)Apr 21 2022, 11:13 AM

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin2002 for host backup2004.codfw.wmnet with OS bullseye completed:

backup2004 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204211032_jynus_3654892_backup2004.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Change 785156 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Preconfigure mc client config on worker nodes

https://gerrit.wikimedia.org/r/785156

gerritbot added a project: Patch-For-Review.Apr 21 2022, 1:14 PM

Change 785156 merged by Jcrespo:

[operations/puppet@production] mediabackup: Preconfigure mc client config on worker nodes

https://gerrit.wikimedia.org/r/785156

Maintenance_bot removed a project: Patch-For-Review.Apr 21 2022, 1:31 PM

Change 785161 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Fix formatting and syntax error on mc config template

https://gerrit.wikimedia.org/r/785161

gerritbot added a project: Patch-For-Review.Apr 21 2022, 1:49 PM

Change 785161 merged by Jcrespo:

[operations/puppet@production] mediabackups: Fix formatting and syntax error on mc config template

https://gerrit.wikimedia.org/r/785161

Change 785166 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Hide diffs from mc config file

https://gerrit.wikimedia.org/r/785166

Change 785166 merged by Jcrespo:

[operations/puppet@production] mediabackup: Hide diffs from mc config file

https://gerrit.wikimedia.org/r/785166

Change 786285 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Clone localy the mediawiki-config repo

https://gerrit.wikimedia.org/r/786285

Change 786285 merged by Jcrespo:

[operations/puppet@production] mediabackup: Clone localy the mediawiki-config repo

https://gerrit.wikimedia.org/r/786285

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin2002 for host ms-backup2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin2002 for host ms-backup2002.codfw.wmnet with OS bullseye completed:

ms-backup2002 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204261014_jynus_291962_ms-backup2002.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host ms-backup1002.eqiad.wmnet with OS bullseye

jcrespo updated the task description. (Show Details)Apr 26 2022, 10:48 AM

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host ms-backup1002.eqiad.wmnet with OS bullseye completed:

ms-backup1002 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204261045_jynus_3035850_ms-backup1002.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin2002 for host ms-backup2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host ms-backup1001.eqiad.wmnet with OS bullseye

jcrespo updated the task description. (Show Details)Apr 26 2022, 11:35 AM

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin2002 for host ms-backup2001.codfw.wmnet with OS bullseye completed:

ms-backup2001 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204261113_jynus_300548_ms-backup2001.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

jcrespo updated the task description. (Show Details)Apr 26 2022, 11:54 AM

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host ms-backup1001.eqiad.wmnet with OS bullseye completed:

ms-backup1001 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204261133_jynus_3073070_ms-backup1001.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

jcrespo updated the task description. (Show Details)Apr 26 2022, 12:06 PM

jcrespo added a subtask: T286722: Broadcom BCM57412 10G NIC and Bullseye installer.Apr 28 2022, 11:11 AM

jcrespo updated the task description. (Show Details)Apr 28 2022, 5:08 PM

Mentioned in SAL (#wikimedia-operations) [2022-04-29T15:29:06Z] <jynus> update NIC firmware for backup1002 T286722 T305446

Stashbot mentioned this in T286722: Broadcom BCM57412 10G NIC and Bullseye installer.Apr 29 2022, 3:29 PM

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye executed with errors:

backup1002 (FAIL)
- Downtimed on Icinga/Alertmanager
- Unable to disable Puppet, the host may have been unreachable
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS buster executed with errors:

backup1002 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS buster executed with errors:

backup1002 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS buster executed with errors:

backup1002 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye executed with errors:

backup1002 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye executed with errors:

backup1002 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye executed with errors:

backup1002 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye executed with errors:

backup1002 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye executed with errors:

backup1002 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details

Despite all the issues with backup1002, the good news is that upgrading bacula to the new minor version should have no issues. Not only there should be no compatibility issues, also it will suffice the FD <= (DIR == SD) version requirements.

backup1002 is sending its data currently to backup1008, and hopefully by tomorrow ( https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=backup1008&var-datasource=thanos&var-cluster=misc&viewPanel=28&from=1651478528336&to=1651543259092 ) I will be able to have it back up. If not, I will have to pool backup1008 temporarily as the new ES db backup producer.

jcrespo updated the task description. (Show Details)May 2 2022, 6:31 PM

Change 788706 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] install_server: Wipe backup1002 completely

https://gerrit.wikimedia.org/r/788706

Change 788706 merged by Jcrespo:

[operations/puppet@production] install_server: Wipe backup1002 completely

https://gerrit.wikimedia.org/r/788706

Change 788707 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] install_server: Update backup-format recipe to install on sdb/sdc

https://gerrit.wikimedia.org/r/788707

Change 788707 merged by Jcrespo: