⚓ T311686 Upgrade ganeti/codfw to Bullseye

	Subject	Repo	Branch	Lines +/-
	Disable Ganeti cluster rebalances temporarily	operations/puppet	production	+1 -1
	Enable component/ganeti3 for codfw	operations/puppet	production	+1 -0

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2009.codfw.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-operations) [2022-07-21T06:36:46Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2003.codfw.wmnet with reason: Switch instance to DRBD, T311686

Mentioned in SAL (#wikimedia-operations) [2022-07-21T06:37:13Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2003.codfw.wmnet with reason: Switch instance to DRBD, T311686

MoritzMuehlenhoff updated the task description. (Show Details)Jul 21 2022, 6:47 AM

Mentioned in SAL (#wikimedia-operations) [2022-07-21T06:47:12Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on ml-etcd2003.codfw.wmnet with reason: Switch instance to plain disks, T311686

Mentioned in SAL (#wikimedia-operations) [2022-07-21T06:47:27Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-etcd2003.codfw.wmnet with reason: Switch instance to plain disks, T311686

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2009.codfw.wmnet with OS bullseye completed:

ganeti2009 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207210617_jmm_2371094_ganeti2009.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

MoritzMuehlenhoff updated the task description. (Show Details)Jul 21 2022, 6:55 AM

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2026.codfw.wmnet with OS bullseye

MoritzMuehlenhoff updated the task description. (Show Details)Jul 21 2022, 7:17 AM

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2026.codfw.wmnet with OS bullseye completed:

ganeti2026 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207210657_jmm_2380394_ganeti2026.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-07-21T09:56:27Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on ml-etcd2001.codfw.wmnet with reason: Switch instance to plain disk storage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-07-21T09:56:43Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-etcd2001.codfw.wmnet with reason: Switch instance to plain disk storage, T311686

MoritzMuehlenhoff updated the task description. (Show Details)Jul 21 2022, 10:04 AM

MoritzMuehlenhoff updated the task description. (Show Details)Jul 21 2022, 10:13 AM

MoritzMuehlenhoff updated the task description. (Show Details)Jul 21 2022, 10:43 AM

Mentioned in SAL (#wikimedia-operations) [2022-07-21T10:46:00Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 3:00:00 on kubetcd2006.codfw.wmnet with reason: Switch to DRBD, T311686

Mentioned in SAL (#wikimedia-operations) [2022-07-21T10:46:16Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on kubetcd2006.codfw.wmnet with reason: Switch to DRBD, T311686

Mentioned in SAL (#wikimedia-operations) [2022-07-21T11:14:10Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2006.codfw.wmnet with reason: Switch instance to plain disk storage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-07-21T11:14:14Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2006.codfw.wmnet with reason: Switch instance to plain disk storage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-07-21T15:11:24Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2014.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-07-21T15:11:39Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2014.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-07-22T05:16:47Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2021.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-07-22T05:17:14Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2021.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2021.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2021.codfw.wmnet with OS bullseye completed:

ganeti2021 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207220519_jmm_2594346_ganeti2021.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2014.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2014.codfw.wmnet with OS bullseye completed:

ganeti2014 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207220557_jmm_2603171_ganeti2014.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

MoritzMuehlenhoff updated the task description. (Show Details)Jul 22 2022, 10:59 AM

MoritzMuehlenhoff updated the task description. (Show Details)Jul 22 2022, 11:59 AM

Mentioned in SAL (#wikimedia-operations) [2022-08-02T07:18:04Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to DRBD, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-02T07:18:31Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to DRBD, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-02T07:33:18Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to plain disks, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-02T07:33:23Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to plain disks, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-02T12:59:45Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2013.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-02T13:00:23Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2013.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2013.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2013.codfw.wmnet with OS bullseye completed:

ganeti2013 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208021302_jmm_950165_ganeti2013.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-08-03T07:00:27Z] <moritzm> draining ganeti2011 T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-03T07:05:32Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-03T07:05:47Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-03T09:32:34Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2011.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-03T09:33:01Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2011.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2011.codfw.wmnet with OS bullseye

MoritzMuehlenhoff updated the task description. (Show Details)Aug 3 2022, 9:48 AM

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2011.codfw.wmnet with OS bullseye completed:

ganeti2011 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208030934_jmm_1136677_ganeti2011.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

MoritzMuehlenhoff updated the task description. (Show Details)Aug 3 2022, 11:12 AM

Mentioned in SAL (#wikimedia-operations) [2022-08-04T05:16:56Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2030.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-04T05:17:12Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2030.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2030.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2030.codfw.wmnet with OS bullseye completed:

ganeti2030 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208040522_jmm_112037_ganeti2030.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

MoritzMuehlenhoff updated the task description. (Show Details)Aug 4 2022, 7:42 AM

MoritzMuehlenhoff updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2022-08-04T08:04:11Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-04T08:04:26Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-04T08:48:27Z] <moritzm> draining ganeti2017 T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-04T10:27:16Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2017.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-04T10:27:32Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2017.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2017.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2017.codfw.wmnet with OS bullseye completed:

ganeti2017 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208041030_jmm_167180_ganeti2017.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

MoritzMuehlenhoff updated the task description. (Show Details)Aug 4 2022, 12:31 PM

Mentioned in SAL (#wikimedia-operations) [2022-08-22T14:38:23Z] <moritzm> draining ganeti2019 for reimage T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-23T06:41:58Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2019.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-23T06:42:14Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2019.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2019.codfw.wmnet with OS bullseye

Change 825678 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Disable Ganeti cluster rebalances temporarily

https://gerrit.wikimedia.org/r/825678

gerritbot added a project: Patch-For-Review.Aug 23 2022, 7:10 AM

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2019.codfw.wmnet with OS bullseye completed:

ganeti2019 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208230649_jmm_1639683_ganeti2019.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Change 825678 merged by Muehlenhoff:

[operations/puppet@production] Disable Ganeti cluster rebalances temporarily

https://gerrit.wikimedia.org/r/825678

Maintenance_bot removed a project: Patch-For-Review.Aug 23 2022, 8:30 AM

MoritzMuehlenhoff updated the task description. (Show Details)Aug 24 2022, 8:15 AM

Mentioned in SAL (#wikimedia-operations) [2022-08-25T14:35:40Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on ganeti2025.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-25T14:35:55Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on ganeti2025.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2025.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2025.codfw.wmnet with OS bullseye completed:

ganeti2025 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208260721_jmm_2362782_ganeti2025.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

MoritzMuehlenhoff updated the task description. (Show Details)Aug 26 2022, 10:42 AM

Mentioned in SAL (#wikimedia-operations) [2022-08-30T08:53:14Z] <moritzm> failover Ganeti master in codfw to ganeti2020 T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-30T09:31:42Z] <moritzm> draining ganeti2022 for eventual reimage T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-30T09:51:09Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2001.codfw.wmnet with reason: Switch instance to DRBD, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-30T09:51:24Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2001.codfw.wmnet with reason: Switch instance to DRBD, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-30T10:15:49Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2001.codfw.wmnet with reason: Switch instance to plain disks, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-30T10:15:54Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2001.codfw.wmnet with reason: Switch instance to plain disks, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-30T15:17:32Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on ganeti2022.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-30T15:17:47Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on ganeti2022.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2022.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2022.codfw.wmnet with OS bullseye completed:

ganeti2022 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208301632_jmm_3386722_ganeti2022.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

MoritzMuehlenhoff updated the task description. (Show Details)Aug 31 2022, 8:45 AM

Mentioned in SAL (#wikimedia-operations) [2022-08-31T11:22:23Z] <moritzm> draining ganeti2015 for eventual reimage T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-31T15:54:25Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on ganeti2015.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-31T15:54:40Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on ganeti2015.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2015.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2015.codfw.wmnet with OS bullseye completed:

ganeti2015 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202209010650_jmm_3765619_ganeti2015.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

MoritzMuehlenhoff updated the task description. (Show Details)Sep 1 2022, 8:32 AM

Mentioned in SAL (#wikimedia-operations) [2022-09-01T11:59:48Z] <moritzm> rebalance row B after completed Bullseye updates T311686

Mentioned in SAL (#wikimedia-operations) [2022-09-27T10:03:50Z] <moritzm> rebalance ganeti/codfw row D after completed Bullseye update T311686

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2023.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2023.codfw.wmnet with OS bullseye completed:

ganeti2023 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202310060712_jmm_3237070_ganeti2023.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

This is complete

Upgrade ganeti/codfw to Bullseye
Closed, ResolvedPublic
Actions

Description

Details

Event Timeline

	MoritzMuehlenhoff
	Jun 30 2022, 7:25 AM

Upgrade ganeti/codfw to BullseyeClosed, ResolvedPublicActions

Description

Details

Event Timeline

Upgrade ganeti/codfw to Bullseye
Closed, ResolvedPublic
Actions