Page MenuHomePhabricator

Upgrade eqiad/codfw Ganeti clusters to Buster
Closed, ResolvedPublic

Description

Edge ganeti clusters were setup with Buster from the start, but eqiad/codfw are still in need of an update:

  • Import the last backport of ganeti 2.16.0-1~bpo9+1 to a repo component ganeti/216 (stretch)
  • add the repo component to the Ganeti hosts
  • Upgrade a single node to 2.16 and test (verify, verify-disks, create new VM, start/stop, migrate VM back & forth)
  • Upgrade remaining stretch nodes to 2.16
  • Run gnt-cluster renew-crypto (for SHA256 compat) https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=907569
  • Run gnt-cluster upgrade --to 2.16
  • Run gnt-cluster verify, verify-disks, migrations back and forth, stop/start VMs
  • For each stretch host:
  • Empty a node of VMs with gnt-node migrate / gnt-node evacuate -s
  • Reimage to buster (which also brings new qemu compared to before)
  • Re-add the node and test
  • Move back VMs

Event Timeline

Volans triaged this task as Medium priority.Jun 14 2021, 7:14 AM

Change 703699 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add component for Ganeti 2.16 backport for Stretch

https://gerrit.wikimedia.org/r/703699

Change 703699 merged by Muehlenhoff:

[operations/puppet@production] Add component for Ganeti 2.16 backport for Stretch

https://gerrit.wikimedia.org/r/703699

Mentioned in SAL (#wikimedia-operations) [2021-07-08T08:42:40Z] <moritzm> imported ganeti 2.16.0 for stretch-security/component/ganeti216 T284811

Mentioned in SAL (#wikimedia-operations) [2021-09-06T13:42:58Z] <moritzm> updated thirdparty/gitlab component to 14.0.10 T284811

akosiaris subscribed.

One word of warning. Make sure that the new version of qemu on buster is compatible with the old one (e.g. regarding migrations). It should be, but in the past they 've been some pain points.

Change 719476 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add Hiera option to enable Ganeti 2.16 backport

https://gerrit.wikimedia.org/r/719476

Change 719476 merged by Muehlenhoff:

[operations/puppet@production] Add Hiera option to enable Ganeti 2.16 backport

https://gerrit.wikimedia.org/r/719476

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: testvm2002.codfw.wmnet

  • testvm2002.codfw.wmnet (WARN)
    • Host not found on Icinga, unable to downtme it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: testvm2001.codfw.wmnet

  • testvm2001.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox

The tests happened with the ad hoc test cluster setup a few months ago (new hardware for a proper 3 node test cluster was ordered, but 10G NICs had a lead time of almost three months), so two servers which had been bought for capex were used for the tests. Having just two nodes doesn't allow to test a few live migration setups, but the hardware for the actual test cluster has arrived on the mean time.

After the setup of the test cluster with two nodes running Stretch and Ganeti 2.15 the following steps were validated:

  • Installation of two test instances, tested to be working fine
  • Tests with node migration and instance migration were successful
  • Cluster rebalance was tested

Next the non-master node was updated:

  • Added the ganeti216 component to the non-master node
  • Ganeti was upgrade to 2.16 and 2.15 packages were removed
  • "gnt-cluster verify" and "gnt-cluster verify-disks" on the master node worked fine
  • The instance running on the upgraded node continued to work fine
  • The new node was emptied with gnt-node migrate -f and the instance which formerly ran there continued to work fine
  • A new instance was created on the upgraded node. This unveiled an issue when starting instances; there's a dangling symlink left behind. Once those were removed and recreated to point to ganeti 2.16 the d-i installation worked fine:
sudo rm /etc/ganeti/lib  /etc/ganeti/share
sudo ln -s /usr/lib/ganeti/2.16 /etc/ganeti/lib
sudo ln -s /usr/share/ganeti/2.16 /etc/ganeti/share
  • A "gnt-instance reboot" of the newly created instance went fine
  • The node running 2.16 was emptied and all primary instances moved to the 2.15 master. All instances continued to be working fine.
  • The node running 2.15 was emptied and all primary instances moved to the 2.16 node. All instances continued to be working fine.
  • With the instances migrated to the 2.16 node, a "gnt-cluster verify" throws a few errors, caused by the version mismatch:
Fri Oct  1 14:47:50 2021 * Verifying hypervisor parameters
Fri Oct  1 14:47:50 2021   - ERROR: cluster: hypervisor cluster parameters syntax check (source kvm): Unknown parameter 'kvm_pci_reservations'
Fri Oct  1 14:47:50 2021   - ERROR: cluster: hypervisor instance testvm2002.codfw.wmnet parameters syntax check (source kvm): Unknown parameter 'use_guest_agent'
Fri Oct  1 14:47:50 2021   - ERROR: cluster: hypervisor instance testvm2003.codfw.wmnet parameters syntax check (source kvm): Unknown parameter 'use_guest_agent'
Fri Oct  1 14:47:50 2021   - ERROR: cluster: hypervisor instance testvm2001.codfw.wmnet parameters syntax check (source kvm): Unknown parameter 'use_guest_agent'

There are no functional issues, though.

I also tried a master failover towards the 2.16 node, but that caused a number of issues (but was easy to rollback), so we have to keep in mind that while the updates are in flight, no failovers can be done:

Fri Oct  1 14:45:04 2021 * Gathering information about nodes (2 nodes)
Fri Oct  1 14:45:05 2021 * Gathering disk information (2 nodes)
Fri Oct  1 14:45:06 2021 * Verifying configuration file consistency
Fri Oct  1 14:45:06 2021   - ERROR: node ganeti2025.codfw.wmnet: Node did not return file checksum data
Fri Oct  1 14:45:06 2021   - ERROR: node ganeti2026.codfw.wmnet: Node did not return file checksum data
Fri Oct  1 14:45:06 2021 * Verifying node status
Fri Oct  1 14:45:06 2021   - ERROR: node ganeti2025.codfw.wmnet: while contacting node: Error while executing backend function: need more than 3 values to unpack
Fri Oct  1 14:45:06 2021   - ERROR: node ganeti2026.codfw.wmnet: while contacting node: Error while executing backend function: need more than 3 values to unpack
Fri Oct  1 14:45:06 2021 * Verifying instance status
Fri Oct  1 14:45:06 2021   - ERROR: instance testvm2002.codfw.wmnet: instance not running on its primary node ganeti2026.codfw.wmnet
Fri Oct  1 14:45:06 2021   - ERROR: node ganeti2026.codfw.wmnet: instance testvm2002.codfw.wmnet, connection to primary node failed
Fri Oct  1 14:45:06 2021   - ERROR: node ganeti2025.codfw.wmnet: instance testvm2002.codfw.wmnet, connection to secondary node failed
Fri Oct  1 14:45:06 2021   - ERROR: instance testvm2003.codfw.wmnet: instance not running on its primary node ganeti2026.codfw.wmnet
Fri Oct  1 14:45:06 2021   - ERROR: node ganeti2026.codfw.wmnet: instance testvm2003.codfw.wmnet, connection to primary node failed
Fri Oct  1 14:45:06 2021   - ERROR: node ganeti2025.codfw.wmnet: instance testvm2003.codfw.wmnet, connection to secondary node failed
Fri Oct  1 14:45:06 2021   - ERROR: instance testvm2001.codfw.wmnet: instance not running on its primary node ganeti2026.codfw.wmnet
Fri Oct  1 14:45:06 2021   - ERROR: node ganeti2026.codfw.wmnet: instance testvm2001.codfw.wmnet, connection to primary node failed
Fri Oct  1 14:45:06 2021   - ERROR: node ganeti2025.codfw.wmnet: instance testvm2001.codfw.wmnet, connection to secondary node failed
Fri Oct  1 14:45:06 2021 * Verifying orphan volumes
Fri Oct  1 14:45:06 2021 * Verifying N+1 Memory redundancy
Fri Oct  1 14:45:06 2021   - ERROR: node ganeti2025.codfw.wmnet: not enough memory to accomodate instance failovers should node ganeti2026.codfw.wmnet fail (9216MiB needed, 0MiB available)
Fri Oct  1 14:45:06 2021 * Other Notes
Fri Oct  1 14:45:06 2021 * Hooks Results
$ sudo gnt-cluster renew-crypto --new-cluster-certificate
Updating certificates now. Running "gnt-cluster verify"  is recommended after this operation.
This requires all daemons on all nodes to be restarted and may take
some time. Continue?
y/[n]/?: y
Gathering cluster information
Blocking watcher
Stopping master daemons
Stopping daemons on ganeti2025.codfw.wmnet
Starting daemon 'ganeti-noded' on ganeti2025.codfw.wmnet
Starting daemon 'ganeti-wconfd' on ganeti2025.codfw.wmnet
Stopping daemons on ganeti2026.codfw.wmnet
Updating the cluster SSL certificate.
Copying /var/lib/ganeti/server.pem to ganeti2026.codfw.wmnet:22
Updating client SSL certificates.
Copying /var/lib/ganeti/ssconf_master_candidates_certs to ganeti2026.codfw.wmnet:22
Starting daemons on ganeti2026.codfw.wmnet
Stopping daemon 'ganeti-wconfd' on ganeti2025.codfw.wmnet
Stopping daemon 'ganeti-noded' on ganeti2025.codfw.wmnet
Starting daemons on ganeti2025.codfw.wmnet
Mon Oct  4 09:57:30 2021 Renewing Node SSL certificates
All requested certificates and keys have been replaced. Running "gnt-cluster verify" now is recommended.

Next gnt-cluster upgrade --to 2.16 was run on the master node.

  • All three instances continued to be accessible
  • Following that "gnt-cluster verify" and "gnt-cluster verify-disks" worked fine again
  • A cluster rebalance was tested and worked fine
  • A master failover worked fine
  • The non-master node was emptied of instances and eventually rebooted which worked fine
  • A VM was migrated to the rebooted node
  • testvm2003.codfw was removed using the decomission cookbook which worked fine
  • A new testvm2004 was created using the makevm cookbook. This unveiled an unusual error: On instance creation it was showing an error failing to communicate with ganeti-metad
Wed Oct  6 12:26:51 2021  - INFO: Waiting for instance testvm2004.codfw.wmnet to sync disks
Wed Oct  6 12:26:51 2021  - INFO: Instance testvm2004.codfw.wmnet's disks are in sync
Wed Oct  6 12:26:51 2021 Could not update metadata for instance 'testvm2004.codfw.wmnet': Error while executing backend function: Failed to start metadata daemon

And in fact after the upgrade ganeti-metad isn't running (we don't have Icinga monitoring for this Ganeti daemon since it's of little importance). This seemed to be related to the ownership of it's log file, on Ganeti 2.15 /var/log/ganeti/meta daemon.log is owned by

-rw-r----- 1 root gnt-daemons 121 Oct  3 06:25 meta-daemon.log

But comparing this with a Ganeti server in esams (which was directly installed with 2.16 on buster), the the file is owned by gnt-metad:

-rw-r----- 1 gnt-metad gnt-metad 121 Oct  3 00:00 meta-daemon.log

The gnt-metad user is also present on the 2.16 host after an update, so this seems to be missing in the upgrade scripts. As such, I ran

sudo cumin A:ganeti-test 'chown gnt-metad /var/log/ganeti/meta-daemon.log'

To confirm I created a new testvm2005 and that worked as expected:

Thu Oct  7 09:17:18 2021  - INFO: - device disk/0: 100.00% done, 0s remaining (estimated)
Thu Oct  7 09:17:19 2021  - INFO: Instance testvm2005.codfw.wmnet's disks are in sync
Thu Oct  7 09:17:19 2021  - INFO: Waiting for instance testvm2005.codfw.wmnet to sync disks
Thu Oct  7 09:17:19 2021  - INFO: Instance testvm2005.codfw.wmnet's disks are in sync

The next step to test was the reimage towards Buster (also shipping Ganeti 2.16, but with a fresher OS stack (but notably a more recent KVM/qemu). Since the test cluster was only comprised of two nodes no full evacuation of the node could be done (since there's no separate nodes to evacuate the primary _and_ secondary instance to).

As a workaround first the secondary instances were moved away from ganeti2025 with gnt-node migrate -f ganeti2025.codfw.wmnet

Subsequently DRBD was disabled by converting the instances to a plain disk without DRDB using

sudo gnt-instance stop testvm2001.codfw.wmnet
sudo gnt-instance modify -t plain testvm2001.codfw.wmnet
sudo gnt-instance start testvm2001.codfw.wmnet

(and similar for the other instances)

Since we have a few instances running in plain mode (etcd nodes), we'll need to apply a similar scheme when reimaging hosts: They'll need to be temporarily switched to DRBD so that we can migrate them away from nodes being reimaged and eventually switched back to "plain".

This caused some issues, after changing the type I couldn't connect to the instances (gnt-instance console got stuck with a defunct kvm process lingering around. But since the DRBD->plain->DRDB step was a detour only needed for the adhoc cluster, I ignored this. This bug has been debugged/fixed later, see below.

Next ganeti2025 was reimaged (which was empty of instances). I didn't gnt-node remove the server before reimaging, but in hindsight I should have done that, it didn't make a difference, though since there's a --readd flag for "gnt-node add".

The reimage hung, when debugging I discovered a bug in the puppetisation which didn't restrict the ganeti216 component to stretch (since Buster has ganeti 2.16 natively. This caused Ganeti not to be installed on the host correctly). After merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/730154 the puppet run completed and I could readd the node. (Following the reimage, the ganeti VG needs to be re-created, the network bridges setup and ssh_host_rsa_key/ssh_host_rsa_key.pub/known_hosts synced.)

When running gnt-cluster verify now the only remaining warning printed is about DRBD (which is fine since the protocol version is identical) and the three instances being in plain state:

$ sudo gnt-cluster verify
Submitted jobs 21827, 21828
Waiting for job 21827 ...
Tue Oct 12 12:28:53 2021 * Verifying cluster config
Tue Oct 12 12:28:53 2021 * Verifying cluster certificate files
Tue Oct 12 12:28:53 2021 * Verifying hypervisor parameters
Tue Oct 12 12:28:53 2021 * Verifying all nodes belong to an existing group
Waiting for job 21828 ...
Tue Oct 12 12:28:54 2021 * Verifying group 'row_D'
Tue Oct 12 12:28:54 2021 * Gathering data (2 nodes)
Tue Oct 12 12:28:54 2021 * Gathering information about nodes (2 nodes)
Tue Oct 12 12:28:57 2021 * Gathering disk information (2 nodes)
Tue Oct 12 12:28:57 2021 * Verifying configuration file consistency
Tue Oct 12 12:28:57 2021 * Verifying node status
Tue Oct 12 12:28:57 2021   - WARNING: node b1ee3d81-4218-442e-a488-3748b7e2cd70: DRBD version mismatch: 8.4.7 (api:1/proto:86-101)
Tue Oct 12 12:28:57 2021   - WARNING: node efca263b-9242-4d15-b0eb-49c0d9c4425f: DRBD version mismatch: 8.4.10 (api:1/proto:86-101)
Tue Oct 12 12:28:57 2021 * Verifying instance status
Tue Oct 12 12:28:57 2021 * Verifying orphan volumes
Tue Oct 12 12:28:57 2021 * Verifying N+1 Memory redundancy
Tue Oct 12 12:28:57 2021 * Other Notes
Tue Oct 12 12:28:57 2021   - NOTICE: 3 non-redundant instance(s) found.
Tue Oct 12 12:28:57 2021 * Hooks Results

Next I created a new VM test2006 with the makevm cookbook which went fine. The console still couldn't be access/installed via gnt-instance console, though (but see below).

Following that the master was failed over to the host which was reimaged to buster. I tried to live-migrate test2006 to ganeti2026 (which was still at buster),
but that failed:

$sudo gnt-instance list
Instance               Hypervisor OS                  Primary_node           Status     Memory
testvm2001.codfw.wmnet kvm        debootstrap+default ganeti2026.codfw.wmnet ADMIN_down      -
testvm2002.codfw.wmnet kvm        debootstrap+default ganeti2026.codfw.wmnet running      1.0G
testvm2005.codfw.wmnet kvm        debootstrap+default ganeti2026.codfw.wmnet running      1.0G
testvm2006.codfw.wmnet kvm        debootstrap+default ganeti2025.codfw.wmnet running      2.0G
root@ganeti2025:/home/jmm# sudo gnt-instance migrate -f  testvm2006.codfw.wmnet
Wed Oct 13 13:12:57 2021 Migrating instance testvm2006.codfw.wmnet
Wed Oct 13 13:12:57 2021 * warning: hypervisor version mismatch between source ([3, 1, 0]) and target ([2, 8, 1]) node
Wed Oct 13 13:12:57 2021   migrating from hypervisor version [3, 1, 0] to [2, 8, 1] should be safe
Wed Oct 13 13:12:57 2021 * checking disk consistency between source and target
Wed Oct 13 13:12:57 2021 * closing instance disks on node ganeti2026.codfw.wmnet
Wed Oct 13 13:12:57 2021 * changing into standalone mode
Wed Oct 13 13:12:58 2021 * changing disks into dual-master mode
Wed Oct 13 13:12:59 2021 * wait until resync is done
Wed Oct 13 13:12:59 2021 * opening instance disks on node ganeti2025.codfw.wmnet in shared mode
Wed Oct 13 13:12:59 2021 * opening instance disks on node ganeti2026.codfw.wmnet in shared mode
Wed Oct 13 13:12:59 2021 * preparing ganeti2026.codfw.wmnet to accept the instance
Wed Oct 13 13:13:00 2021 Pre-migration failed, aborting
Wed Oct 13 13:13:00 2021 * closing instance disks on node ganeti2026.codfw.wmnet
Wed Oct 13 13:13:00 2021 * changing into standalone mode
Wed Oct 13 13:13:00 2021 * changing disks into single-master mode
Wed Oct 13 13:13:01 2021 * wait until resync is done
Failure: command execution error:
Could not pre-migrate instance testvm2006.codfw.wmnet: Failed to accept instance: Failed to start instance testvm2006.codfw.wmnet: exited with exit code 1 (qemu-system-x86_64: -machine pc-i440fx-3.1,accel=kvm: unsupported machine type
Use -machine help to list supported machines
)

My first assumption was that the qemu stretch backport we used (which was the last version available at stretch-backports in Debian: https://tracker.debian.org/news/989520/accepted-ganeti-2160-1bpo91-source-all-amd64-into-stretch-backports-stretch-backports/ ) lacked the compat patches for QEMU 3.1 (the version in Debian buster) by Apollon: https://github.com/ganeti/ganeti/pull/1342. These only landed in Debian in 2.16.0-5 so were missing in the stretch-backports package. I added the patches to a local ganeti 2.16 stretch-wikimedia backport, but that didn't make a difference.

Comparing supported machine types with "qemu-system-x86_64 -machine help" shows that qemu 2.8 supports:

pc                   Standard PC (i440FX + PIIX, 1996) (alias of pc-i440fx-2.8)
pc-i440fx-2.8        Standard PC (i440FX + PIIX, 1996) (default)
pc-i440fx-2.7        Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.6        Standard PC (i440FX + PIIX, 1996)
(and more...)

While 3.1 supports and defaults to:

pc                   Standard PC (i440FX + PIIX, 1996) (alias of pc-i440fx-3.1)
pc-i440fx-3.1        Standard PC (i440FX + PIIX, 1996) (default)
pc-i440fx-3.0        Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.9        Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.8        Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.7        Standard PC (i440FX + PIIX, 1996)
pc-i440fx-2.6        Standard PC (i440FX + PIIX, 1996)
(and more...)

Alex pointed out that this is configurable with gnt-cluster and in fact that did the trick:

sudo gnt-cluster modify --hypervisor-parameters kvm:machine_version=pc-i440fx-2.8

This will require a restart of all instances (via gnt-instance reboot FOO, not from within the OS) to effect the change, so there's a window after the update where we'll need to reboot all affected instances before live migrations work again.

After a "gnt-instance reboot testvm2006.codfw.wmnet" the migration was in fact successful:

$ sudo gnt-instance migrate -f  testvm2006.codfw.wmnet
Thu Oct 14 09:57:24 2021 Migrating instance testvm2006.codfw.wmnet
Thu Oct 14 09:57:24 2021 * warning: hypervisor version mismatch between source ([3, 1, 0]) and target ([2, 8, 1]) node
Thu Oct 14 09:57:24 2021   migrating from hypervisor version [3, 1, 0] to [2, 8, 1] should be safe
Thu Oct 14 09:57:24 2021 * checking disk consistency between source and target
Thu Oct 14 09:57:25 2021 * closing instance disks on node ganeti2026.codfw.wmnet
Thu Oct 14 09:57:25 2021 * changing into standalone mode
Thu Oct 14 09:57:25 2021 * changing disks into dual-master mode
Thu Oct 14 09:57:26 2021 * wait until resync is done
Thu Oct 14 09:57:26 2021 * opening instance disks on node ganeti2025.codfw.wmnet in shared mode
Thu Oct 14 09:57:27 2021 * opening instance disks on node ganeti2026.codfw.wmnet in shared mode
Thu Oct 14 09:57:27 2021 * preparing ganeti2026.codfw.wmnet to accept the instance
Thu Oct 14 09:57:27 2021 * migrating instance to ganeti2026.codfw.wmnet
Thu Oct 14 09:57:27 2021 * starting memory transfer
Thu Oct 14 09:57:28 2021 * memory transfer complete
Thu Oct 14 09:57:29 2021 * closing instance disks on node ganeti2025.codfw.wmnet
Thu Oct 14 09:57:29 2021 * wait until resync is done
Thu Oct 14 09:57:29 2021 * changing into standalone mode
Thu Oct 14 09:57:29 2021 * changing disks into single-master mode
Thu Oct 14 09:57:31 2021 * wait until resync is done
Thu Oct 14 09:57:31 2021 * done
  • The other three instances (testvm2001, testvm2002, testvm2005) were also rebooted.
  • Next the second node running Stretch was emptied of instances (two instances had "plain" layout, so remained on host:
sudo gnt-node migrate -f ganeti2026.codfw.wmnet
  • Next testvm2005 was reverted back to "plain" and restarted. Then I deleted all testvm instances except testvm2005 (since there was no point in also shifting them over to the buster instance for this test).
  • With ganeti2026 now being empty, the next step was to reimage it to buster as well. First the node was removed with gnt-node remove ganeti2026.codfw.wmnet and eventually it was reimaged with the reimage cookbook.

(Following the reimage, the ganeti VG needs to be re-created and the network bridges setup (along with a reboot). Then ssh_host_rsa_key/ssh_host_rsa_key.pub/known_hosts was synced from the master synced and the node was added with the sre.ganeti.addnode cookbook).

  • But now that both instances were running stock Ganeti 2.16 from Buster and the respective QEMU/KVM, gnt-instance console was still stuck for testvm2005. Next I tried to add new instance, which worked fine, but again gnt-instance console froze even through the whole cluster was now completely on Buster (and we have working Buster setups already).
  • That was very odd since a transitionary isssue between Stretch/Buster could be ruled out at this point.
  • I compared differences between a Buster instance which was working fine (ganeti3001) and ganeti2025/2026. One difference was that netcat on Buster was provided by netcat-openbsd, while on 2025/2026 it was provided by netcat-traditional. But modifying 2025/2026 to also use netcat-openbsd didn't make a difference.
  • One other difference was that ganeti/esams was running Linux 4.19.171, while 2025/2026 by means of the fresh installation was running 4.19.208 (so this could have been a regression in the kernel which wasn't noticed on Buster since we didn't run the latest version in Ganeti/esams yet). As such, I reverted 2025/2026 to a 4.19.171 kernel, but that didn't make a difference either.
  • Under the hood gnt-instance console invokes /usr/lib/ganeti/tools/kvm-console-wrapper via SSH from the master node, the full command can be displayed with
gnt-instance console --show-cmd $INSTANCE

To rule out any remote invocation shenanigans I invoked the kvm-console-wrapper command locally, but it was also stuck. Looking closer at kvm-console-wrapper it's just a small wrapper which helps sort out concurrent access to the "serial" console, which eventually invokes socat like e.g.:

/usr/bin/socat STDIO,raw,echo=0,escape=0x1d UNIX-CONNECT:/var/run/ganeti/kvm-hypervisor/ctrl/testvm2005.codfw.wmnet.serial

socat is a low level tool which is proven and in which no issues were expected, but after fruitlessly debugging other angels I eventually returned to looking into a potential socat issue, since I had noticed some ioctl failures in gdb. To rule out this source of error, I built a backport of socat 1.7.4.1 from bullseye. This was mostly a shot in the dark to rule out that option, but updating socat to 1.7.4.1 immediately reinstated console access! This is a quite strange since we're not seeing this error on the existing buster ganeti installs.

After some more digging I also found https://bugs.launchpad.net/ubuntu/+source/socat/+bug/1883957 which describes a similar bug (but the version in Ubuntu focal is 1.7.3.3, while Buster has 1.7.3.2 and http://www.dest-unreach.org/socat/doc/CHANGES claims a 1.7.3.3 specific bug. At this point I didn't poke further and will upload a backport to the ganeti216 component.

Now that the hardware for the actual three node Ganeti test cluster has arrived, I'll take down ganeti2025/ganeti2026 and will install the proper test cluster. Then I'll retest the upgrades with the three node cluster with the final update procedure/packages and then codfw would be the first DC to update.

Wow, that's a very detailed writeup. Thanks! Couple of comments inline:

This will require a restart of all instances (via gnt-instance reboot FOO, not from within the OS) to effect the change, so there's a window after the update where we'll need to reboot all affected instances before live migrations work again.

We don't have to wait for the update. We can do sudo gnt-cluster modify --hypervisor-parameters kvm:machine_version=pc-i440fx-2.8 on the current clusters, restart all the VMs (probably in some kernel upgrade window) and stick with that version for a while regardless of the underlying ganeti/kernel/OS version. That should decouple the machine_version migration (which should anyway be done) and allow it to happen later in time (probably again piggybacking on some kernel upgrade window).

A new instance was created on the upgraded node. This unveiled an issue when starting instances; there's a dangling symlink left behind. Once those were removed and recreated to point to ganeti 2.16 the d-i installation worked fine:

Ouch. I did not expect that. Should we report it upstream to get it fixed? Also, are we going to address it via some script or via some docs for our upgrade? Same question goes for /var/log/ganeti/meta daemon.log.

Since we have a few instances running in plain mode (etcd nodes), we'll need to apply a similar scheme when reimaging hosts: They'll need to be temporarily switched to DRBD so that we can migrate them away from nodes being reimaged and eventually switched back to "plain".

Note that the main reason we are in plain mode for those nodes is that etcd is very sensitive to the extra latency added by DRBD in I/O. So, there is a possibility we are going to see an alert for high latencies while performing this operation. To avoid that, it's ok to shutdown the instance for the entirety of the process.

After some more digging I also found https://bugs.launchpad.net/ubuntu/+source/socat/+bug/1883957 which describes a similar bug (but the version in Ubuntu focal is 1.7.3.3, while Buster has 1.7.3.2 and http://www.dest-unreach.org/socat/doc/CHANGES claims a 1.7.3.3 specific bug. At this point I didn't poke further and will upload a backport to the ganeti216 component.

Nice catch and fix!

The rest LGTM.

Thanks for doublechecking the steps!

We don't have to wait for the update. We can do sudo gnt-cluster modify --hypervisor-parameters kvm:machine_version=pc-i440fx-2.8 on the current clusters, restart all the VMs (probably in some kernel upgrade window) and stick with that version for a while regardless of the underlying ganeti/kernel/OS version. That should decouple the machine_version migration (which should anyway be done) and allow it to happen later in time (probably again piggybacking on some kernel upgrade window).

Ah sure, I hadn't considered that, that sounds like a good idea to uncouple it! I'll create a sub task for this.

A new instance was created on the upgraded node. This unveiled an issue when starting instances; there's a dangling symlink left behind. Once those were removed and recreated to point to ganeti 2.16 the d-i installation worked fine:

Ouch. I did not expect that. Should we report it upstream to get it fixed? Also, are we going to address it via some script or via some docs for our upgrade? Same question goes for /var/log/ganeti/meta daemon.log.

As part of the upgrade I'll simply run these en bloc via Cumin. The symlink setup is transitionary and will vanish with the Buster reimages anyway, but the meta_daemon.log permissions will be fixed via Puppet on an ongoing manner.

I'd also like to report these upstream, but first need to make sure this is still an issue with current 3.x releases (Debian and upstream). Once the Ganeti Buster updates are completed in eqiad/codfw I plan to reinstall the test cluster with the Ganeti 3 packages from Bullseye, that should give an idea whether it still needs fixing.

Since we have a few instances running in plain mode (etcd nodes), we'll need to apply a similar scheme when reimaging hosts: They'll need to be temporarily switched to DRBD so that we can migrate them away from nodes being reimaged and eventually switched back to "plain".

Note that the main reason we are in plain mode for those nodes is that etcd is very sensitive to the extra latency added by DRBD in I/O. So, there is a possibility we are going to see an alert for high latencies while performing this operation. To avoid that, it's ok to shutdown the instance for the entirety of the process.

But with the reimage, just shutting them down means we'd lose the VMs? So I think we can either briefly transition them to DRBD for the instance migration (and silence alerts during that) or alternatively create new etcd nodes and remove the old ones one at a time, but that seems like much more overhead in comparison?

Since we have a few instances running in plain mode (etcd nodes), we'll need to apply a similar scheme when reimaging hosts: They'll need to be temporarily switched to DRBD so that we can migrate them away from nodes being reimaged and eventually switched back to "plain".

Note that the main reason we are in plain mode for those nodes is that etcd is very sensitive to the extra latency added by DRBD in I/O. So, there is a possibility we are going to see an alert for high latencies while performing this operation. To avoid that, it's ok to shutdown the instance for the entirety of the process.

But with the reimage, just shutting them down means we'd lose the VMs? So I think we can either briefly transition them to DRBD for the instance migration (and silence alerts during that) or alternatively create new etcd nodes and remove the old ones one at a time, but that seems like much more overhead in comparison?

Sorry, I should have been clearer. I meant, a) shut them down b) switch the disk template to DRBD c) failover/migrate the VM to the new node d) switch again to plain e) power them up again. That way for the duration of the process they will not be part of the cluster and will not cause latency increases and alerts.

I did consider the create new ones solution too but as you point out it's much overheard and makes this migration more complicated without any significant gain.

But with the reimage, just shutting them down means we'd lose the VMs? So I think we can either briefly transition them to DRBD for the instance migration (and silence alerts during that) or alternatively create new etcd nodes and remove the old ones one at a time, but that seems like much more overhead in comparison?

Sorry, I should have been clearer. I meant, a) shut them down b) switch the disk template to DRBD c) failover/migrate the VM to the new node d) switch again to plain e) power them up again.

Ah, I get it now. Sounds good, let's do that.

Mentioned in SAL (#wikimedia-operations) [2021-11-02T19:50:16Z] <moritzm> imported ganeti 2.16.0-1~bpo9+1+wmf1to component/ganeti216 for stretch-wikimedia (with additional cherrypicked patches for compat with KVM 3.1) T284811

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti-test2002.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti-test2002.codfw.wmnet with OS buster completed:

  • ganeti-test2002 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202111180953_jmm_734362_ganeti-test2002.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2021-11-18T13:22:23Z] <moritzm> failover ganeti master in test cluster to ganeti-test2002 T284811

Change 740104 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/cookbooks@master] New cookbook to reboot a VM on the Ganeti level

https://gerrit.wikimedia.org/r/740104

Mentioned in SAL (#wikimedia-operations) [2021-11-19T13:23:05Z] <moritzm> draining instances from ganeti-test2001 for reimage T284811

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti-test2003.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti-test2003.codfw.wmnet with OS buster completed:

  • ganeti-test2003 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202111221041_jmm_1439437_ganeti-test2003.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

The following upgrade steps were taken towards 2.16. After going through the upgrade again, it turns out the update procedure attempted in the earlier test was needlessly complex and we can instead rely on the co-installability of Ganeti packages and instead let Ganeti switch over the active version via the upgrade itself. As part of this upgrade test I also validated that setting the KVM machine type to "kvm:machine_version=pc-i440fx-2.8" early on (before a first machine was created), fixes the live migration issues between 2.8 and 3.1 seen in the earlier test.

As such the much simpler upgrade procedure is:

  • sudo gnt-cluster modify --hypervisor-parameters kvm:machine_version=pc-i440fx-2.8 (before VMs were created)
  • component/ganeti216 was added to all three nodes
  • sudo apt-get install -y ganeti (this installs Ganeti 2.16 in parallel, but doesn't change the currently running version)
  • sudo cumin A:ganeti-test 'chown gnt-metad /var/log/ganeti/meta-daemon.log' (to address the permission issue preventing the start of metad seen before
  • sudo gnt-cluster verify
  • sudo gnt-cluster renew-crypto --new-cluster-certificate
  • sudo gnt-cluster verify
  • sudo gnt-cluster upgrade --to 2.16
  • sudo gnt-cluster verify

Following that various smoke tests were done with the 2.16/stretch cluster:

  • A new instance testvm2002 was created and working fine
  • The master was failed over
  • A node was drained and rebooted

Next one node was emptied of primary/secondary instances:

  • gnt-node migrate -f ganeti-test2002.codfw.wmnet
  • gnt-node evacuate -s ganeti-test2002.codfw.wmnet
  • remove the node during the reimage: gnt-node remove ganeti-test2002.codfw.wmnet
  • copy /etc/network/interfaces
  • ganeti-test2002 was reimaged and after the reimage /etc/network/interfaces was restored (interface name changed between stretch/buster, though which needs to be fixed up) and ssh_host_rsa_key/ssh_host_rsa_key.pub synched
  • the ganeti VG was created and the node rebooted.
  • Finally the node was re-added with the sre.ganeti.addnode cookbook
  • Next I ran hbal to rebalance the cluster which moved one of the instances to the freshly reimaged node. That worked fine, the instance continued to be accessible and the console was also reachable.
  • Then the master was failed over to the reimaged Buster host
  • A new instance testvm2003 was created and installed which worked fine. It used a Stretch node (ganeti-test2001) for the primary. Next ganeti-test2001 was drained of instances, which also worked fine.
  • ganeti-test2001 was reimaged and readded back to the cluster
  • ganeti-test2003 was drained, reimaged and added back to the cluster
  • A final cluster rebalance worked fine

This looks promising. The next step is to deploy T294119 and after that the codfw cluster can get migrated (and subsequently eqiad).

Change 740104 merged by Muehlenhoff:

[operations/cookbooks@master] New cookbook to reboot a VM on the Ganeti level

https://gerrit.wikimedia.org/r/740104

Change 742499 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Disable cluster rebalances temporarily

https://gerrit.wikimedia.org/r/742499

Change 742499 merged by Muehlenhoff:

[operations/puppet@production] Disable cluster rebalances temporarily

https://gerrit.wikimedia.org/r/742499

I don't know if this is result of this ticket or something unrelated but there is a lot of root@ spam with:

Cluster configuration incomplete: 'Can't read ssconf file /var/lib/ganeti/ssconf_master_node: [Errno 2] No such file or directory: '/var/lib/ganeti/ssconf_master_node''

Can you take a look or should I create a separate ticket?

Change 765202 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Revert \"Disable cluster rebalances temporarily\"

https://gerrit.wikimedia.org/r/765202

Change 765202 merged by Muehlenhoff:

[operations/puppet@production] Revert \"Disable cluster rebalances temporarily\"

https://gerrit.wikimedia.org/r/765202

MoritzMuehlenhoff claimed this task.

Both main Ganeti cluster have been upgraded to Buster.