Page MenuHomePhabricator

Migrate codfw Ganeti cluster to Buster
Closed, ResolvedPublic

Description

  • Add component/ganeti216 to all nodes
  • Upgrade all nodes using "cumin A:ganeti-codfw 'apt-get install -y ganeti'"
  • cumin A:ganeti-codfw 'chown gnt-metad /var/log/ganeti/meta-daemon.log'
  • sudo gnt-cluster renew-crypto --new-cluster-certificate --new-rapi-certificate --new-spice-certificate
  • sudo gnt-cluster upgrade --to 2.16
  • sudo gnt-cluster renew-crypto --new-node-certificates (might not be needed for future updates if the hashing algo stays the same)

Following that empty/reimage/readd the following virtualisation servers:

  • ganeti2007.codfw.wmnet
  • ganeti2008.codfw.wmnet
  • ganeti2009.codfw.wmnet
  • ganeti2010.codfw.wmnet
  • ganeti2011.codfw.wmnet
  • ganeti2012.codfw.wmnet
  • ganeti2013.codfw.wmnet
  • ganeti2014.codfw.wmnet
  • ganeti2015.codfw.wmnet
  • ganeti2016.codfw.wmnet
  • ganeti2017.codfw.wmnet
  • ganeti2018.codfw.wmnet
  • ganeti2019.codfw.wmnet
  • ganeti2020.codfw.wmnet
  • ganeti2021.codfw.wmnet
  • ganeti2022.codfw.wmnet
  • ganeti2023.codfw.wmnet
  • ganeti2024.codfw.wmnet

The following instances were temporarily converted from plain disk storage to DRBD:

  • ml-etcd2001
  • ml-etcd2003
  • kubetcd2004.codfw.wmnet (and reverted back to plain)
  • kubetcd2005.codfw.wmnet (and reverted back to plain)
  • kubetcd2006.codfw.wmnet (and reverted back to plain)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2021-12-01T08:56:10Z] <moritzm> draining primary/secondary instance off ganeti2010 T296622

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2010.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2010.codfw.wmnet with OS buster executed with errors:

  • ganeti2010 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2010.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2010.codfw.wmnet with OS buster executed with errors:

  • ganeti2010 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2010.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2010.codfw.wmnet with OS buster executed with errors:

  • ganeti2010 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2010.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2010.codfw.wmnet with OS buster completed:

  • ganeti2010 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112020802_jmm_3158750_ganeti2010.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2021-12-02T09:52:33Z] <moritzm> draining primary/secondary instances off ganeti2009 T296622

Mentioned in SAL (#wikimedia-operations) [2021-12-02T11:21:35Z] <moritzm> draining primary/secondary instances off ganeti2022 T296622

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2009.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2009.codfw.wmnet with OS buster executed with errors:

  • ganeti2009 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2009.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2009.codfw.wmnet with OS buster completed:

  • ganeti2009 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112030829_jmm_3340607_ganeti2009.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2021-12-03T09:38:10Z] <moritzm> draining primary/secondary instances off ganeti2011 T296622

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2011.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2011.codfw.wmnet with OS buster executed with errors:

  • ganeti2011 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2022.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2022.codfw.wmnet with OS buster completed:

  • ganeti2022 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112031130_jmm_3364086_ganeti2022.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2021-12-03T12:37:15Z] <moritzm> draining primary/secondary instances off ganeti2007 T296622

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2011.codfw.wmnet with OS buster

Mentioned in SAL (#wikimedia-operations) [2021-12-06T10:23:32Z] <moritzm> draining primary/secondary instances off ganeti2015 T296622

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2011.codfw.wmnet with OS buster completed:

  • ganeti2011 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112060958_jmm_3884662_ganeti2011.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2021-12-06T11:12:37Z] <moritzm> draining primary/secondary instances off ganeti2016 T296622

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2016.codfw.wmnet with OS buster

Mentioned in SAL (#wikimedia-operations) [2021-12-06T14:19:17Z] <moritzm> draining primary/secondary instances off ganeti2012 T296622

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2016.codfw.wmnet with OS buster executed with errors:

  • ganeti2016 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2016.codfw.wmnet with OS buster

Mentioned in SAL (#wikimedia-operations) [2021-12-07T08:55:50Z] <moritzm> draining primary/secondary instances off ganeti2013 T296622

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2016.codfw.wmnet with OS buster completed:

  • ganeti2016 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112070846_jmm_4046836_ganeti2016.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2012.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2012.codfw.wmnet with OS buster completed:

  • ganeti2012 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112070929_jmm_4052927_ganeti2012.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2021-12-07T11:51:51Z] <moritzm> draining primary/secondary instances off ganeti2014 T296622

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2013.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2013.codfw.wmnet with OS buster completed:

  • ganeti2013 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112080834_jmm_23826_ganeti2013.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2014.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2014.codfw.wmnet with OS buster completed:

  • ganeti2014 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112080922_jmm_31220_ganeti2014.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2021-12-08T13:56:33Z] <moritzm> drain primary/secondary instance off ganeti2015 T296622

Mentioned in SAL (#wikimedia-operations) [2021-12-08T14:17:01Z] <moritzm> drain primary/secondary instance off ganeti2020 T296622

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2020.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2020.codfw.wmnet with OS buster completed:

  • ganeti2020 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112090819_jmm_199207_ganeti2020.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2021-12-10T08:13:11Z] <moritzm> drain primary/secondary instance off ganeti2008 T296622

Mentioned in SAL (#wikimedia-operations) [2021-12-10T14:55:09Z] <moritzm> drain primary/secondary instances off ganeti2017 T296622

Mentioned in SAL (#wikimedia-operations) [2021-12-13T07:46:25Z] <moritzm> drain primary/secondary instances off ganeti2021 T296622

MoritzMuehlenhoff changed the task status from Open to In Progress.Dec 13 2021, 3:20 PM

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2008.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2008.codfw.wmnet with OS buster completed:

  • ganeti2008 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112140821_jmm_1050014_ganeti2008.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2021-12-14T08:54:47Z] <moritzm> failover Ganeti master to ganeti2016 T296622

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2017.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2017.codfw.wmnet with OS buster completed:

  • ganeti2017 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112140857_jmm_1055424_ganeti2017.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2021.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2021.codfw.wmnet with OS buster completed:

  • ganeti2021 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112140946_jmm_1063069_ganeti2021.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2021-12-14T12:19:56Z] <moritzm> drain primary/secondary instances off ganeti2018 T296622

Mentioned in SAL (#wikimedia-operations) [2021-12-14T15:50:58Z] <moritzm> drain primary/secondary instances off ganeti2023 T296622

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2018.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2018.codfw.wmnet with OS buster completed:

  • ganeti2018 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112150844_jmm_1224861_ganeti2018.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2021-12-15T12:50:12Z] <moritzm> drain primary/secondary instances off ganeti2024 T296622

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2024.codfw.wmnet with OS buster

Mentioned in SAL (#wikimedia-operations) [2021-12-16T08:55:59Z] <moritzm> drain primary/secondary instances off ganeti2015 T296622

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2024.codfw.wmnet with OS buster completed:

  • ganeti2024 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112160819_jmm_1397524_ganeti2024.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2021-12-16T10:09:27Z] <moritzm> drain primary/secondary instances off ganeti2007 T296622

Mentioned in SAL (#wikimedia-operations) [2021-12-16T14:44:32Z] <moritzm> drain primary/secondary instances off ganeti2007 T296622

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2015.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2015.codfw.wmnet with OS buster completed:

  • ganeti2015 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112200814_jmm_2066526_ganeti2015.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Along with the reimages 2012 (row C) and 2016 got auto-promoted as master candidates, but the VIP used for RAPI access by cookbooks needs to be from row B. I've manually ran "sudo gnt-node modify --master-candidate=yes " for 2020/2021/2022 and removed it for 2012/2016. As a followup when the whole cluster update is done, I'll run "gnt-node modify --master-capable=no" for the nodes in row A/C/D, so that this is taken into account automatically in the future.

Mentioned in SAL (#wikimedia-operations) [2021-12-20T13:33:29Z] <moritzm> fail over master in codfw to ganeti2021 T296622

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2007.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2007.codfw.wmnet with OS buster completed:

  • ganeti2007 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112210907_jmm_2250627_ganeti2007.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
MoritzMuehlenhoff claimed this task.

This is complete, the entire Ganeti cluster is codfw is on Buster.