Page MenuHomePhabricator

Upgrade ganeti-test to Bullseye
Closed, ResolvedPublic

Description

Test the Ganeti Bullseye update in ganeti-test. Tentative steps:

  • Add component/ganeti3 to all nodes
  • Upgrade all nodes to the new Ganeti 3
  • Investigate whether we need to renew cluster/rapi/spice/node certificates (not needed)
  • sudo gnt-cluster upgrade --to 3.0

Empty every node of primary/secondary instances. Remove it with gnt-node remove and reimage it to Bullseye, then re-add with gnt-node add:

  • ganeti-test2001
  • ganeti-test2002
  • ganeti-test2003

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2022-04-21T09:41:28Z] <moritzm> upgrading the Ganeti test cluster to 3.0 T306499

Mentioned in SAL (#wikimedia-operations) [2022-04-25T11:20:17Z] <moritzm> failover Ganeti master in codfw-test to ganeti-test2003 T306499

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti-test2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti-test2001.codfw.wmnet with OS bullseye completed:

  • ganeti-test2001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204260945_jmm_287028_ganeti-test2001.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-04-27T13:46:58Z] <moritzm> rebalance ganeti-test after adding new bullseye node T306499

Mentioned in SAL (#wikimedia-operations) [2022-04-28T09:32:43Z] <moritzm> uploadded ganeti 3.0.1-2+deb11u0 to apt.wikimedia.org (backport of Py2->Py3 regression) T306499

Mentioned in SAL (#wikimedia-operations) [2022-04-28T10:56:40Z] <moritzm> failover Ganeti master in ganeti-test to ganeti-test1001 (bullseye node) T306499

Mentioned in SAL (#wikimedia-operations) [2022-04-28T10:56:51Z] <moritzm> failover Ganeti master in ganeti-test to ganeti-test2001 (bullseye node) T306499

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti-test2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti-test2002.codfw.wmnet with OS bullseye completed:

  • ganeti-test2002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204281415_jmm_682113_ganeti-test2002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti-test2003.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti-test2003.codfw.wmnet with OS bullseye completed:

  • ganeti-test2003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204291213_jmm_848997_ganeti-test2003.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

The following upgrade steps were done in the Ganeti test cluster for the 3.0 update:

We'll be keeping the "kvm:machine_version=pc-i440fx-2.8" KVM machine type which was applied as part of the buster update for eqiad/codfw/test cluster. The i440fx base type is fully backwards compatible by qemu and there are no features changes done to i440fx (the main idea of this type is that it provides aged "hardware" which is well supported even in old OSes).

For the ganeti installs in esams/ulsfo/eqsin/drmrs we still need to migrate them to a fixed machine type (but they can stick with "kvm:machine_version=pc-i440fx-3.1").

The upgrade procedure is:

  • sudo gnt-cluster modify --hypervisor-parameters kvm:machine_version=pc-i440fx-3.1 and gnt-instance restarts of all VMs (only for esams/ulsfo/eqsin/drmrs)
  • Add component/ganeti3 for all nodes of a cluster via Hiera
  • apt-get install -y -o 'DPkg::Options::=--force-confold' ganeti=3.0.1-1~bpo10+1 (this installs Ganeti 3 in parallel, but doesn't change the currently running version)
  • sudo gnt-cluster verify
  • sudo gnt-cluster upgrade --to 3.0

Following that various smoke tests were done with the 3.0/buster cluster:

  • The master was failed over
  • A node was drained and rebooted
  • A cluster rebalance was working fine
  • A new instance testvm2004 was created and found to be working fine

Next, one node was emptied of primary/secondary instances:

  • sudo gnt-node migrate -f ganeti-test2001.codfw.wmnet
  • sudo gnt-node evacuate -s ganeti-test2001.codfw.wmnet
  • remove the node during the reimage: sudo gnt-node remove ganeti-test2001.codfw.wmnet
  • copy /etc/network/interfaces and cluster SSH keys
  • ganeti-test2001 was reimaged and after the reimage /etc/network/interfaces was restored (interface name changed between stretch/buster, though which needs to be fixed up) and ssh_host_rsa_key/ssh_host_rsa_key.pub synched
  • the ganeti VG was created and the node rebooted.
  • finally the node was re-added with the sre.ganeti.addnode cookbook

That worked fine, but "gnt-node add" printed a message:

The certificate differs after being reencoded. Please renew the certificates cluster-wide to prevent future inconsistencies.

After some digging this turned out to be a regression in the Python2 -> Python3 conversion that landed in Ganeti 3. I prepared a ganeti 3.0.1-2+deb11u0 update with the patch and will also work towards getting that fix into Bullseye 11.4. This will still trigger until the Ganeti master is updated to Bullseye, but we can ignore it for now, it's a short-lived issue.

Next I ran hbal to rebalance the cluster which moved one of the instances to the freshly reimaged node. That worked fine, the instance continued to be accessible and the console was also reachable.

Then the master was failed over to the reimaged Bullseye host.

A new instance testvm2005 was created and installed which worked fine. It used a Buster node (ganeti-test2002) as the primary and the other Buster node (ganeti-test2003) as the secondary. The secondary was then moved to the Bullseye/2001 host, which also worked fine. Likewise the primary instance.

Next ganeti-test2002 was drained, reimaged to Bullseye and readded to the cluster, which worked all fine. Same for ganeti-test2003. Finally a final cluster rebalance was done.

So the update procedure is working correctly and other Ganeti clusters (initially the edges) can be upgraded next \o/