Page MenuHomePhabricator

Upgrade backup* hosts to bullseye
Closed, ResolvedPublic

Description

Upgrade the following hosts, used for database backups, media backups and bacula storage to Debian Bullseye:

  • ms-backup1001
  • ms-backup1002
  • ms-backup2001
  • ms-backup2002
  • backup1001 (requires firmware update)
  • backup1002 (requires firmware update)
  • backup1003
  • backup1004
  • backup1005
  • backup1006
  • backup1007
  • backup1008
  • backup2001 (requires firmware update)
  • backup2002 (requires firmware update, it may need an additional clean reimage)
  • backup2003
  • backup2004
  • backup2005
  • backup2006
  • backup2007
  • backup2008

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin2002 for host backup2007.codfw.wmnet with OS bullseye completed:

  • backup2007 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204110859_jynus_1923350_backup2007.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

@MoritzMuehlenhoff I need your advice here. When reimaging backup hosts (and preserving its data dir), the debmonitor user takes over the uid of the minio-user. This would require changing the owner of the files- but this doesn't scale- it took over a week, even with multiple threads in parallel to update the hundreds of millions of backup files in backup2007 (I went throught it as I was already too deep into it when I realized it). During the maintenance, minio has to be down. Note that that is not exclusive to backup hosts- but it is worse there as we have higher consolidation than on swift, as backups are not focused on performance (slow disks with lots of data).

The way I see it, there are a few alternatives:

  • After reimage, the debmonitor user gets switched with the minio-user's UID and you help me fix debmonitor (which should be easier) rather than trying to fix minio permissions
  • Create a hack for the user in post-inst script installer, like swift hosts do
  • Perform an upgrade "in place" so basic config files don't get recreated, including /etc/passwd
  • Any other option, including some kind of support on the reimage process/puppet for stable UIDs

I would prefer #1 for now to unblock the upgrades now, as it would be -I think- the easier, but as the owner of debmonitor, I would like your thoughts on this. Ideally, a better solution would be available in the long run.

@MoritzMuehlenhoff I need your advice here. When reimaging backup hosts (and preserving its data dir), the debmonitor user takes over the uid of the minio-user. This would require changing the owner of the files- but this doesn't scale- it took over a week, even with multiple threads in parallel to update the hundreds of millions of backup files in backup2007 (I went throught it as I was already too deep into it when I realized it). During the maintenance, minio has to be down. Note that that is not exclusive to backup hosts- but it is worse there as we have higher consolidation than on swift, as backups are not focused on performance (slow disks with lots of data).

The way I see it, there are a few alternatives:

  • After reimage, the debmonitor user gets switched with the minio-user's UID and you help me fix debmonitor (which should be easier) rather than trying to fix minio permissions
  • Create a hack for the user in post-inst script installer, like swift hosts do
  • Perform an upgrade "in place" so basic config files don't get recreated, including /etc/passwd
  • Any other option, including some kind of support on the reimage process/puppet for stable UIDs

I would prefer #1 for now to unblock the upgrades now, as it would be -I think- the easier, but as the owner of debmonitor, I would like your thoughts on this. Ideally, a better solution would be available in the long run.

Ok, so for context, over the last two years John and myself made various changes how system users are handled, but some of those changes are only trickling in with reimages, so the state can be a little confusing depending on whether we're looking at an old system or a new one.

  • System users which are local to a specific host use an UID in the 100-499 range (servers installed before October 2019 lacked a config setting for adduser/systemd-sysuser which specified the intended rage). The only exception is systemd-coredump which uses 999 (it gets created early before we apply the config setting via Puppet)
  • Human users start at 1000 (as configured in modules/admin/data/data.yaml
  • There are a handful of UIDs which need to be consistent between more than one server. Typically that's because files are moved around via rsync or so, but minio falls into that area as well since the data persists a reimage. These fleet-wide UIDs are allocated in modules/admin/data/data.yaml as well and start at 901. You mentioned swift above which currently still uses the install hook hack, but it will also switch to that mechanism, the UID 902 is already claimed in data.yaml

My recommendation would be to:

  • Go with #1 for the current round of reimages to unblock you. You can easily change the debmonitor UID around, the only two directories/files you need to fix up with chown are 1) /etc/debmonitor (which holds the certs) and 2) /var/log/debmonitor-client. When done simply run "sudo debmonitor-client" and if it doesn't flag any errors, everything went fine.
  • Mid term also move minio to a fixed system UID. For that you can already claim it in data.yaml (next one would be 914, you can leave it commented out initially) and then server by server migrate the existing files to use 914. When all hosts are complete, you can then amend the systemd::sysuser config in class mediabackup::storage to use "id => '914:914'" and all future reimages will handle it correctly.

Thanks, Moritz, that's super useful. I will ping the rest of data persistence about this- I am not sure this is widely known (I didn't knew about it) and it will be useful for database backups and backup storage (mysql and backups user)- which ideally will also persist its UID within the fleet.

FWIW, I'm planning to also document our UID handling properly on wikitech, sometime in the next weeks.

Change 784633 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] admin: Add placeholder to reserve uid and git 914 for minio-user

https://gerrit.wikimedia.org/r/784633

Thanks @MoritzMuehlenhoff. Kindly suggesting you to unsub from the ticket now, as reimage tickets usually have *lots* of probably irrelevant noise for you due to automation.

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1007.eqiad.wmnet with OS bullseye executed with errors:

  • backup1007 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1007.eqiad.wmnet with OS bullseye completed:

  • backup1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204201556_jynus_854386_backup1007.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Failed to get Netbox script results, try manually: https://netbox.wikimedia.org/api/extras/job-results/2903969/

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1007.eqiad.wmnet with OS bullseye executed with errors:

  • backup1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204201556_jynus_854386_backup1007.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Failed to get Netbox script results, try manually: https://netbox.wikimedia.org/api/extras/job-results/2903969/
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1006.eqiad.wmnet with OS bullseye completed:

  • backup1006 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204210557_jynus_1195941_backup1006.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin2002 for host backup2006.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin2002 for host backup2006.codfw.wmnet with OS bullseye completed:

  • backup2006 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204210633_jynus_3624578_backup2006.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1005.eqiad.wmnet with OS bullseye

Change 784633 merged by Jcrespo:

[operations/puppet@production] admin: Add placeholder to reserve uid and gid 914 for minio-user

https://gerrit.wikimedia.org/r/784633

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1005.eqiad.wmnet with OS bullseye completed:

  • backup1005 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204210811_jynus_1217038_backup1005.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin2002 for host backup2005.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1004.eqiad.wmnet with OS bullseye completed:

  • backup1004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204210855_jynus_1243423_backup1004.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin2002 for host backup2005.codfw.wmnet with OS bullseye completed:

  • backup2005 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204210853_jynus_3642224_backup2005.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin2002 for host backup2004.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye executed with errors:

  • backup1002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • New OS is buster but bullseye was requested
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin2002 for host backup2004.codfw.wmnet with OS bullseye completed:

  • backup2004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204211032_jynus_3654892_backup2004.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 785156 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Preconfigure mc client config on worker nodes

https://gerrit.wikimedia.org/r/785156

Change 785156 merged by Jcrespo:

[operations/puppet@production] mediabackup: Preconfigure mc client config on worker nodes

https://gerrit.wikimedia.org/r/785156

Change 785161 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Fix formatting and syntax error on mc config template

https://gerrit.wikimedia.org/r/785161

Change 785161 merged by Jcrespo:

[operations/puppet@production] mediabackups: Fix formatting and syntax error on mc config template

https://gerrit.wikimedia.org/r/785161

Change 785166 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Hide diffs from mc config file

https://gerrit.wikimedia.org/r/785166

Change 785166 merged by Jcrespo:

[operations/puppet@production] mediabackup: Hide diffs from mc config file

https://gerrit.wikimedia.org/r/785166

Change 786285 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Clone localy the mediawiki-config repo

https://gerrit.wikimedia.org/r/786285

Change 786285 merged by Jcrespo:

[operations/puppet@production] mediabackup: Clone localy the mediawiki-config repo

https://gerrit.wikimedia.org/r/786285

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin2002 for host ms-backup2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin2002 for host ms-backup2002.codfw.wmnet with OS bullseye completed:

  • ms-backup2002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204261014_jynus_291962_ms-backup2002.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host ms-backup1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host ms-backup1002.eqiad.wmnet with OS bullseye completed:

  • ms-backup1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204261045_jynus_3035850_ms-backup1002.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin2002 for host ms-backup2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host ms-backup1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin2002 for host ms-backup2001.codfw.wmnet with OS bullseye completed:

  • ms-backup2001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204261113_jynus_300548_ms-backup2001.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host ms-backup1001.eqiad.wmnet with OS bullseye completed:

  • ms-backup1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204261133_jynus_3073070_ms-backup1001.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-04-29T15:29:06Z] <jynus> update NIC firmware for backup1002 T286722 T305446

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye executed with errors:

  • backup1002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS buster executed with errors:

  • backup1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS buster executed with errors:

  • backup1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS buster executed with errors:

  • backup1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye executed with errors:

  • backup1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye executed with errors:

  • backup1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye executed with errors:

  • backup1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye executed with errors:

  • backup1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin1001 for host backup1002.eqiad.wmnet with OS bullseye executed with errors:

  • backup1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Despite all the issues with backup1002, the good news is that upgrading bacula to the new minor version should have no issues. Not only there should be no compatibility issues, also it will suffice the FD <= (DIR == SD) version requirements.

backup1002 is sending its data currently to backup1008, and hopefully by tomorrow ( https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=backup1008&var-datasource=thanos&var-cluster=misc&viewPanel=28&from=1651478528336&to=1651543259092 ) I will be able to have it back up. If not, I will have to pool backup1008 temporarily as the new ES db backup producer.

Change 788706 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] install_server: Wipe backup1002 completely

https://gerrit.wikimedia.org/r/788706

Change 788706 merged by Jcrespo:

[operations/puppet@production] install_server: Wipe backup1002 completely

https://gerrit.wikimedia.org/r/788706

Change 788707 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] install_server: Update backup-format recipe to install on sdb/sdc

https://gerrit.wikimedia.org/r/788707

Change 788707 merged by Jcrespo:

[operations/puppet@production] install_server: Update backup-format recipe to install on sdb/sdc

https://gerrit.wikimedia.org/r/788707

backup1001 just upgraded. Still monitoring to check nothing broke in the process.

backup2001. I tested backup running and recovery to 2 hosts: one bullseye and one buster. All worked as expected. Dashboards looking ok: https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&var-site=eqiad&var-job=gerrit1001.wikimedia.org-Hourly-Sun-production-gerrit-repo-data&from=1652181147225&to=1652353947225 Scheduling works.

Considering this as resolved.