Page MenuHomePhabricator

Upgrade ganeti/codfw to Bullseye
Closed, ResolvedPublic

Description

Upgrade the codfw Ganeti cluster to Bullseye:

  • Add component/ganeti3 to all nodes
  • Upgrade all nodes to the new Ganeti 3
  • sudo gnt-cluster upgrade --to 3.0

Empty every node of primary/secondary instances. Remove it with gnt-node remove and reimage it to Bullseye, then re-add:

Row A:

  • ganeti2023.codfw.wmnet
  • ganeti2024.codfw.wmnet
  • ganeti2027.codfw.wmnet
  • ganeti2028.codfw.wmnet
  • ganeti2029.codfw.wmnet
  • ganeti2030.codfw.wmnet

Row B:

  • ganeti2019.codfw.wmnet
  • ganeti2020.codfw.wmnet
  • ganeti2021.codfw.wmnet
  • ganeti2022.codfw.wmnet

Row C:

  • ganeti2009.codfw.wmnet
  • ganeti2010.codfw.wmnet
  • ganeti2011.codfw.wmnet
  • ganeti2012.codfw.wmnet
  • ganeti2013.codfw.wmnet
  • ganeti2014.codfw.wmnet

Row D:

  • ganeti2015.codfw.wmnet
  • ganeti2016.codfw.wmnet
  • ganeti2017.codfw.wmnet
  • ganeti2018.codfw.wmnet
  • ganeti2025.codfw.wmnet
  • ganeti2026.codfw.wmnet

VMs temporarily running on DRBD, need to be switched back:

  • currently none

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2009.codfw.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-operations) [2022-07-21T06:36:46Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2003.codfw.wmnet with reason: Switch instance to DRBD, T311686

Mentioned in SAL (#wikimedia-operations) [2022-07-21T06:37:13Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2003.codfw.wmnet with reason: Switch instance to DRBD, T311686

Mentioned in SAL (#wikimedia-operations) [2022-07-21T06:47:12Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on ml-etcd2003.codfw.wmnet with reason: Switch instance to plain disks, T311686

Mentioned in SAL (#wikimedia-operations) [2022-07-21T06:47:27Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-etcd2003.codfw.wmnet with reason: Switch instance to plain disks, T311686

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2009.codfw.wmnet with OS bullseye completed:

  • ganeti2009 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207210617_jmm_2371094_ganeti2009.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2026.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2026.codfw.wmnet with OS bullseye completed:

  • ganeti2026 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207210657_jmm_2380394_ganeti2026.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-07-21T09:56:27Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on ml-etcd2001.codfw.wmnet with reason: Switch instance to plain disk storage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-07-21T09:56:43Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-etcd2001.codfw.wmnet with reason: Switch instance to plain disk storage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-07-21T10:46:00Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 3:00:00 on kubetcd2006.codfw.wmnet with reason: Switch to DRBD, T311686

Mentioned in SAL (#wikimedia-operations) [2022-07-21T10:46:16Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on kubetcd2006.codfw.wmnet with reason: Switch to DRBD, T311686

Mentioned in SAL (#wikimedia-operations) [2022-07-21T11:14:10Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2006.codfw.wmnet with reason: Switch instance to plain disk storage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-07-21T11:14:14Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2006.codfw.wmnet with reason: Switch instance to plain disk storage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-07-21T15:11:24Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2014.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-07-21T15:11:39Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2014.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-07-22T05:16:47Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2021.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-07-22T05:17:14Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2021.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2021.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2021.codfw.wmnet with OS bullseye completed:

  • ganeti2021 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207220519_jmm_2594346_ganeti2021.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2014.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2014.codfw.wmnet with OS bullseye completed:

  • ganeti2014 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207220557_jmm_2603171_ganeti2014.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-08-02T07:18:04Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to DRBD, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-02T07:18:31Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to DRBD, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-02T07:33:18Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to plain disks, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-02T07:33:23Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to plain disks, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-02T12:59:45Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2013.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-02T13:00:23Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2013.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2013.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2013.codfw.wmnet with OS bullseye completed:

  • ganeti2013 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208021302_jmm_950165_ganeti2013.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-08-03T07:05:32Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-03T07:05:47Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-03T09:32:34Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2011.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-03T09:33:01Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2011.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2011.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2011.codfw.wmnet with OS bullseye completed:

  • ganeti2011 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208030934_jmm_1136677_ganeti2011.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-08-04T05:16:56Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2030.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-04T05:17:12Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2030.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2030.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2030.codfw.wmnet with OS bullseye completed:

  • ganeti2030 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208040522_jmm_112037_ganeti2030.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-08-04T08:04:11Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-04T08:04:26Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-04T10:27:16Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2017.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-04T10:27:32Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2017.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2017.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2017.codfw.wmnet with OS bullseye completed:

  • ganeti2017 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208041030_jmm_167180_ganeti2017.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-08-22T14:38:23Z] <moritzm> draining ganeti2019 for reimage T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-23T06:41:58Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2019.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-23T06:42:14Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2019.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2019.codfw.wmnet with OS bullseye

Change 825678 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Disable Ganeti cluster rebalances temporarily

https://gerrit.wikimedia.org/r/825678

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2019.codfw.wmnet with OS bullseye completed:

  • ganeti2019 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208230649_jmm_1639683_ganeti2019.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 825678 merged by Muehlenhoff:

[operations/puppet@production] Disable Ganeti cluster rebalances temporarily

https://gerrit.wikimedia.org/r/825678

Mentioned in SAL (#wikimedia-operations) [2022-08-25T14:35:40Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on ganeti2025.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-25T14:35:55Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on ganeti2025.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2025.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2025.codfw.wmnet with OS bullseye completed:

  • ganeti2025 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208260721_jmm_2362782_ganeti2025.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-08-30T08:53:14Z] <moritzm> failover Ganeti master in codfw to ganeti2020 T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-30T09:31:42Z] <moritzm> draining ganeti2022 for eventual reimage T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-30T09:51:09Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2001.codfw.wmnet with reason: Switch instance to DRBD, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-30T09:51:24Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2001.codfw.wmnet with reason: Switch instance to DRBD, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-30T10:15:49Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2001.codfw.wmnet with reason: Switch instance to plain disks, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-30T10:15:54Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2001.codfw.wmnet with reason: Switch instance to plain disks, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-30T15:17:32Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on ganeti2022.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-30T15:17:47Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on ganeti2022.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2022.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2022.codfw.wmnet with OS bullseye completed:

  • ganeti2022 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208301632_jmm_3386722_ganeti2022.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-08-31T11:22:23Z] <moritzm> draining ganeti2015 for eventual reimage T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-31T15:54:25Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on ganeti2015.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Mentioned in SAL (#wikimedia-operations) [2022-08-31T15:54:40Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on ganeti2015.codfw.wmnet with reason: Remove node for eventual reimage, T311686

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2015.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2015.codfw.wmnet with OS bullseye completed:

  • ganeti2015 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202209010650_jmm_3765619_ganeti2015.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-09-01T11:59:48Z] <moritzm> rebalance row B after completed Bullseye updates T311686

Mentioned in SAL (#wikimedia-operations) [2022-09-27T10:03:50Z] <moritzm> rebalance ganeti/codfw row D after completed Bullseye update T311686

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2023.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2023.codfw.wmnet with OS bullseye completed:

  • ganeti2023 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202310060712_jmm_3237070_ganeti2023.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
MoritzMuehlenhoff updated the task description. (Show Details)

This is complete