Page MenuHomePhabricator

Upgrade ganeti/eqiad to Bullseye
Closed, ResolvedPublic

Description

Upgrade the eqiad Ganeti cluster to Bullseye:

  • Add component/ganeti3 to all nodes
  • Upgrade all nodes to the new Ganeti 3
  • sudo gnt-cluster upgrade --to 3.0

Empty every node of primary/secondary instances. Remove it with gnt-node remove and reimage it to Bullseye, then re-add.

Row A:

  • ganeti1005.eqiad.wmnet (replaced by ganeti1029 using bullseye)
  • ganeti1006.eqiad.wmnet (replaced by ganeti1030 using bullseye)
  • ganeti1007.eqiad.wmnet (replaced by ganeti1031 using bullseye)
  • ganeti1008.eqiad.wmnet (replaced by ganeti1032 using bullseye)
  • ganeti1023.eqiad.wmnet
  • ganeti1025.eqiad.wmnet
  • ganeti1026.eqiad.wmnet

Row B:

  • ganeti1013.eqiad.wmnet
  • ganeti1014.eqiad.wmnet
  • ganeti1015.eqiad.wmnet
  • ganeti1016.eqiad.wmnet
  • ganeti1017.eqiad.wmnet
  • ganeti1018.eqiad.wmnet

Row C:

  • ganeti1009.eqiad.wmnet
  • ganeti1010.eqiad.wmnet
  • ganeti1011.eqiad.wmnet
  • ganeti1012.eqiad.wmnet
  • ganeti1024.eqiad.wmnet
  • ganeti1027.eqiad.wmnet
  • ganeti1028.eqiad.wmnet

Row D:

  • ganeti1019.eqiad.wmnet
  • ganeti1020.eqiad.wmnet
  • ganeti1021.eqiad.wmnet
  • ganeti1022.eqiad.wmnet

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1017.eqiad.wmnet with OS bullseye completed:

  • ganeti1017 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202210130816_jmm_166015_ganeti1017.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-10-25T08:02:09Z] <moritzm> drain ganeti1023 for eventual reimage T311687

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1023.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-operations) [2022-10-25T12:39:39Z] <moritzm> drain ganeti1015 for eventual reimage T311687

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1023.eqiad.wmnet with OS bullseye completed:

  • ganeti1023 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202210251238_jmm_3030967_ganeti1023.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1015.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1015.eqiad.wmnet with OS bullseye completed:

  • ganeti1015 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202210260852_jmm_3237357_ganeti1015.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-10-26T12:02:58Z] <moritzm> draining ganeti1009 for eventual reimage T311687

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1009.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1009.eqiad.wmnet with OS bullseye completed:

  • ganeti1009 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202210270654_jmm_3476730_ganeti1009.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-11-01T08:32:46Z] <moritzm> draining ganeti1028 for eventual reimage T311687

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1028.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-operations) [2022-11-01T13:59:49Z] <moritzm> draining ganeti1016 for eventual reimage T311687

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1028.eqiad.wmnet with OS bullseye completed:

  • ganeti1028 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211011357_jmm_547889_ganeti1028.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1016.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-operations) [2022-11-02T08:36:40Z] <moritzm> draining ganeti1020 for eventual reimage T311687

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1016.eqiad.wmnet with OS bullseye completed:

  • ganeti1016 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211020815_jmm_732965_ganeti1016.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-11-02T12:59:21Z] <moritzm> draining ganeti1025 for eventual reimage T311687

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye executed with errors:

  • ganeti1025 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye executed with errors:

  • ganeti1025 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye executed with errors:

  • ganeti1025 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye executed with errors:

  • ganeti1025 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye executed with errors:

  • ganeti1025 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bullseye completed:

  • ganeti1025 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211031139_jmm_1008304_ganeti1025.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-11-07T09:18:05Z] <moritzm> draining ganeti1010 for eventual reimage T311687

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1010.eqiad.wmnet with OS bullseye completed:

  • ganeti1010 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211071347_jmm_1973941_ganeti1010.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-11-08T09:17:25Z] <moritzm> drain ganeti1018 for eventual reimage to bullseye T311687

Mentioned in SAL (#wikimedia-operations) [2022-11-08T11:00:36Z] <moritzm> drain ganeti1024 for eventual reimage to bullseye T311687

Mentioned in SAL (#wikimedia-operations) [2022-11-08T16:24:09Z] <moritzm> drain ganeti1024 for eventual reimage to bullseye T311687

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1018.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1018.eqiad.wmnet with OS bullseye completed:

  • ganeti1018 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211090806_jmm_2397124_ganeti1018.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1024.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1024.eqiad.wmnet with OS bullseye completed:

  • ganeti1024 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211090840_jmm_2403818_ganeti1024.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-11-09T13:24:01Z] <moritzm> drain ganeti1013 for eventual reimage to bullseye T311687

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1013.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1013.eqiad.wmnet with OS bullseye completed:

  • ganeti1013 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211100846_jmm_2650687_ganeti1013.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-11-10T14:00:55Z] <moritzm> drain ganeti1020 for eventual reimage to bullseye T311687

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1020.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1020.eqiad.wmnet with OS bullseye completed:

  • ganeti1020 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211110814_jmm_2914438_ganeti1020.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-11-14T07:50:57Z] <moritzm> draining ganeti1021 for eventual reimage T311687

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1021.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1021.eqiad.wmnet with OS bullseye completed:

  • ganeti1021 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211141450_jmm_3706235_ganeti1021.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-11-15T09:36:10Z] <moritzm> draining ganeti1022 for eventual reimage T311687

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1022.eqiad.wmnet with OS bullseye completed:

  • ganeti1022 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211151545_jmm_3961375_ganeti1022.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Hi, Moritz,

I am seeing a couple of non-fatal errors on ganeti. I wonder if they could be artifacts of the bullseye upgrade (in particular, of a ganeti upgrade), as I don't see them in the not-yet-upgraded hosts, but start exactly on the same they were upgraded, FYI:

Cron <root@ganeti1021> [ -x /usr/sbin/gnt-cluster ] && /usr/sbin/gnt-cluster upgrade --resume:
Cluster configuration incomplete: 'Can't read ssconf file /var/lib/ganeti/ssconf_master_node: [Errno 2] No such file or directory: '/var/lib/ganeti/ssconf_master_node'

and

ganeti1021 ganeti[924]: Missing configuration file /var/lib/ganeti/server.pem

Hi, Moritz,

I am seeing a couple of non-fatal errors on ganeti. I wonder if they could be artifacts of the bullseye upgrade (in particular, of a ganeti upgrade), as I don't see them in the not-yet-upgraded hosts, but start exactly on the same they were upgraded, FYI:

Cron <root@ganeti1021> [ -x /usr/sbin/gnt-cluster ] && /usr/sbin/gnt-cluster upgrade --resume:
Cluster configuration incomplete: 'Can't read ssconf file /var/lib/ganeti/ssconf_master_node: [Errno 2] No such file or directory: '/var/lib/ganeti/ssconf_master_node'

and

ganeti1021 ganeti[924]: Missing configuration file /var/lib/ganeti/server.pem

Yes, these are one of log spam artefacts related to the update. We'll soon be done soon (most servers except one by next week), then those are gone.

Ah, so you mean they are temporary during the maintenance, and won't happen once all migrations are done? Then please keep the good work :-P

Ah, so you mean they are temporary during the maintenance, and won't happen once all migrations are done?

Indeed, those are temporary until the migration is complete

Mentioned in SAL (#wikimedia-operations) [2022-11-16T14:39:44Z] <moritzm> draining ganeti1019 for eventual reimage T311687

Mentioned in SAL (#wikimedia-operations) [2022-11-17T08:50:15Z] <moritzm> draining ganeti1019 for eventual reimage T311687

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1019.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1019.eqiad.wmnet with OS bullseye completed:

  • ganeti1019 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211171507_jmm_274410_ganeti1019.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-11-22T09:25:16Z] <moritzm> failover Ganeti master in eqiad to ganeti1028 T311687

Mentioned in SAL (#wikimedia-operations) [2022-11-22T16:02:18Z] <moritzm> drain ganeti1027 for eventual reimage to Bullseye T311687

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1027.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1027.eqiad.wmnet with OS bullseye completed:

  • ganeti1027 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202211230811_jmm_264055_ganeti1027.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-11-23T14:41:32Z] <moritzm> rebalance Ganeti group B/eqiad T311687

Mentioned in SAL (#wikimedia-operations) [2022-11-24T14:58:19Z] <moritzm> rebalance Ganeti group C/eqiad T311687

Mentioned in SAL (#wikimedia-operations) [2022-11-28T09:12:32Z] <moritzm> rebalance Ganeti group A/eqiad T311687

Mentioned in SAL (#wikimedia-operations) [2022-11-28T13:51:34Z] <moritzm> rebalance Ganeti group C/eqiad T311687

Mentioned in SAL (#wikimedia-operations) [2022-11-28T14:35:22Z] <moritzm> rebalance Ganeti group D/eqiad T311687

Mentioned in SAL (#wikimedia-operations) [2023-03-07T13:04:48Z] <moritzm> drain ganeti1011 for eventual reimage to Bullseye T311687

Mentioned in SAL (#wikimedia-operations) [2023-03-08T09:30:30Z] <moritzm> drain ganeti1011 for eventual reimage to Bullseye T311687

Icinga downtime and Alertmanager silence (ID=488c31ea-afbd-425c-93db-bb4f4daa8146) set by jmm@cumin2002 for 2 days, 0:00:00 on 1 host(s) and their services with reason: remove from cluster for reimage

ganeti1011.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-03-09T08:33:17Z] <moritzm> remove ganeti1011 for eventual reimage T311687

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1011.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1011.eqiad.wmnet with OS bullseye completed:

  • ganeti1011 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303090851_jmm_3583065_ganeti1011.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2023-03-09T12:22:58Z] <moritzm> rebalancing ganeti eqiad/C after completion of bullseye updates T311687

Icinga downtime and Alertmanager silence (ID=dcc641f3-257f-4a0d-875d-85c9d542b7f8) set by jmm@cumin2002 for 3 days, 0:00:00 on 1 host(s) and their services with reason: Some tests with pybal/Bullseye

pybal-test2003.codfw.wmnet