Page MenuHomePhabricator

Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8
Closed, ResolvedPublic

Description

As a preparation for the Buster update we need to switch the KVM machine type for the Ganeti cluster to pc-i440fx-2.8 (the hardware provided by qemu 2.8). Otherwise we wouldn't be able to migrate machines between Stretch and Buster nodes (which would default to pc-i440fx-3.1). One the Buster migration is complete we switch back to pc-i440fx-3.1.

This requires the following steps:

  1. sudo gnt-cluster modify --hypervisor-parameters kvm:machine_version=pc-i440fx-2.8
  2. Restart all instances of the cluster. A reboot from within the OS isn't sufficient, this needs to be rebooted on the Ganeti level so that the KVM instance gets restarted (kind of comparable to resetting a computer with the power button). There's a new cookbook for this: sre.ganeti.reboot-vm
  • acmechief1001.eqiad.wmnet
  • acmechief-test1001.eqiad.wmnet
  • an-airflow1001.eqiad.wmnet
  • an-airflow1002.eqiad.wmnet
  • an-airflow1003.eqiad.wmnet
  • an-test-client1001.eqiad.wmnet
  • an-test-druid1001.eqiad.wmnet
  • an-test-presto1001.eqiad.wmnet
  • an-test-ui1001.eqiad.wmnet
  • an-tool1005.eqiad.wmnet
  • an-tool1007.eqiad.wmnet
  • an-tool1008.eqiad.wmnet
  • an-tool1009.eqiad.wmnet
  • aphlict1001.eqiad.wmnet
  • apt1001.wikimedia.org
  • archiva1002.wikimedia.org
  • chartmuseum1001.eqiad.wmnet
  • cloudbackup1001-dev.eqiad.wmnet
  • cloudbackup1002-dev.eqiad.wmnet
  • cuminunpriv1001.eqiad.wmnet
  • d-i-test.eqiad.wmnet
  • dbmonitor1002.wikimedia.org
  • dborch1001.wikimedia.org
  • debmonitor1002.eqiad.wmnet
  • doc1001.eqiad.wmnet
  • doc1002.eqiad.wmnet
  • doh1001.wikimedia.org
  • doh1002.wikimedia.org
  • dragonfly-supernode1001.eqiad.wmnet
  • durum1001.eqiad.wmnet
  • durum1002.eqiad.wmnet
  • etherpad1002.eqiad.wmnet
  • eventlog1003.eqiad.wmnet
  • failoid1002.eqiad.wmnet
  • flowspec1001.eqiad.wmnet
  • gitlab1001.wikimedia.org
  • gitlab-runner1001.eqiad.wmnet
  • grafana1002.eqiad.wmnet
  • idp1001.wikimedia.org
  • idp-test1001.wikimedia.org
  • install1003.wikimedia.org
  • irc1001.wikimedia.org
  • kafka-test1006.eqiad.wmnet
  • kafka-test1007.eqiad.wmnet
  • kafka-test1008.eqiad.wmnet
  • kafka-test1009.eqiad.wmnet
  • kafka-test1010.eqiad.wmnet
  • kafkamon1002.eqiad.wmnet
  • kubemaster1001.eqiad.wmnet
  • kubemaster1002.eqiad.wmnet
  • kubernetes1005.eqiad.wmnet
  • kubernetes1006.eqiad.wmnet
  • kubernetes1015.eqiad.wmnet
  • kubernetes1016.eqiad.wmnet
  • kubestagemaster1001.eqiad.wmnet
  • kubestagetcd1004.eqiad.wmnet
  • kubestagetcd1005.eqiad.wmnet
  • kubestagetcd1006.eqiad.wmnet
  • kubetcd1004.eqiad.wmnet
  • kubetcd1005.eqiad.wmnet
  • kubetcd1006.eqiad.wmnet
  • ldap-corp1001.wikimedia.org
  • ldap-replica1003.wikimedia.org
  • ldap-replica1004.wikimedia.org
  • lists1001.wikimedia.org
  • logstash1007.eqiad.wmnet
  • logstash1008.eqiad.wmnet
  • logstash1009.eqiad.wmnet
  • logstash1023.eqiad.wmnet
  • logstash1024.eqiad.wmnet
  • logstash1025.eqiad.wmnet
  • logstash1030.eqiad.wmnet
  • logstash1031.eqiad.wmnet
  • logstash1032.eqiad.wmnet
  • matomo1002.eqiad.wmnet
  • miscweb1002.eqiad.wmnet
  • ml-etcd1001.eqiad.wmnet
  • ml-etcd1002.eqiad.wmnet
  • ml-etcd1003.eqiad.wmnet
  • ml-serve-ctrl1001.eqiad.wmnet
  • ml-serve-ctrl1002.eqiad.wmnet
  • moscovium.eqiad.wmnet
  • mwdebug1001.eqiad.wmnet
  • mwdebug1002.eqiad.wmnet
  • mx1001.wikimedia.org
  • ncredir1001.eqiad.wmnet
  • ncredir1002.eqiad.wmnet
  • netbox1001.wikimedia.org
  • netboxdb1001.eqiad.wmnet
  • netflow1001.eqiad.wmnet
  • orespoolcounter1003.eqiad.wmnet
  • orespoolcounter1004.eqiad.wmnet
  • otrs1001.eqiad.wmnet
  • people1003.eqiad.wmnet
  • ping1002.eqiad.wmnet
  • planet1002.eqiad.wmnet
  • poolcounter1004.eqiad.wmnet
  • poolcounter1005.eqiad.wmnet
  • puppetboard1001.eqiad.wmnet (decom)
  • puppetboard1002.eqiad.wmnet
  • puppetdb1002.eqiad.wmnet
  • registry1003.eqiad.wmnet
  • registry1004.eqiad.wmnet
  • releases1002.eqiad.wmnet
  • rpki1001.eqiad.wmnet
  • schema1003.eqiad.wmnet
  • schema1004.eqiad.wmnet
  • seaborgium.wikimedia.org
  • search-loader1001.eqiad.wmnet
  • testreduce1001.eqiad.wmnet
  • urldownloader1001.wikimedia.org
  • urldownloader1002.wikimedia.org
  • webperf1001.eqiad.wmnet
  • webperf1002.eqiad.wmnet
  • xhgui1001.eqiad.wmnet
  • zookeeper-test1002.eqiad.wmnet

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2021-12-01T13:30:25Z] <moritzm> set "sudo gnt-cluster modify --hypervisor-parameters kvm:machine_version=pc-i440fx-2.8" for ganeti eqiad cluster T294120

VM etherpad1002.eqiad.wmnet rebooted by aokoth@cumin1001 with reason: Ganeti Migration

VM otrs1001.eqiad.wmnet rebooted by aokoth@cumin1001 with reason: Ganeti Migration

VM acmechief-test1001.eqiad.wmnet rebooted by vgutierrez@cumin1001 with reason: None

Mentioned in SAL (#wikimedia-operations) [2022-01-11T15:55:52Z] <vgutierrez> disable puppet on acme-chief clients for acmechief1001 reboot - T294120

VM acmechief1001.eqiad.wmnet rebooted by vgutierrez@cumin1001 with reason: None

Mentioned in SAL (#wikimedia-operations) [2022-01-11T15:59:11Z] <vgutierrez> re-enable puppet on acme-chief clients after acmechief1001 reboot - T294120

VM ncredir1001.eqiad.wmnet rebooted by vgutierrez@cumin1001 with reason: None

VM ncredir1002.eqiad.wmnet rebooted by vgutierrez@cumin1001 with reason: None

VM doh1002.wikimedia.org rebooted by sukhe@cumin1001 with reason: rebooting for T294120

VM durum1001.eqiad.wmnet rebooted by sukhe@cumin1001 with reason: rebooting for T294120

VM durum1002.eqiad.wmnet rebooted by sukhe@cumin1001 with reason: rebooting for T294120

ssingh added a subscriber: ssingh.

doh1001 was also restarted but I forgot to add the -t switch and that's why you ops-bot didn't catch it :) Updated the hosts.

VM logstash1024.eqiad.wmnet rebooted by cwhite@cumin1001 with reason: None

VM logstash1025.eqiad.wmnet rebooted by cwhite@cumin1001 with reason: None

VM logstash1030.eqiad.wmnet rebooted by cwhite@cumin1001 with reason: None

VM logstash1031.eqiad.wmnet rebooted by cwhite@cumin1001 with reason: None

VM logstash1032.eqiad.wmnet rebooted by cwhite@cumin1001 with reason: None

VM logstash1007.eqiad.wmnet rebooted by cwhite@cumin1001 with reason: None

VM logstash1008.eqiad.wmnet rebooted by cwhite@cumin1001 with reason: None

VM logstash1009.eqiad.wmnet rebooted by cwhite@cumin1001 with reason: None

VM gitlab-runner1001.eqiad.wmnet rebooted by jelto@cumin1001 with reason: Ganeti Migration

VM gitlab1001.wikimedia.org rebooted by jelto@cumin1001 with reason: Ganeti Migration

Change 753506 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/mediawiki-config@master] Depool poolcounter1004

https://gerrit.wikimedia.org/r/753506

Change 753506 merged by jenkins-bot:

[operations/mediawiki-config@master] Depool poolcounter1004

https://gerrit.wikimedia.org/r/753506

Change 753511 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/mediawiki-config@master] Repool poolcounter1004, depool poolcounter1005

https://gerrit.wikimedia.org/r/753511

Change 753511 merged by jenkins-bot:

[operations/mediawiki-config@master] Repool poolcounter1004, depool poolcounter1005

https://gerrit.wikimedia.org/r/753511

Change 753519 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/mediawiki-config@master] Repool poolcounter1005

https://gerrit.wikimedia.org/r/753519

Change 753519 merged by jenkins-bot:

[operations/mediawiki-config@master] Repool poolcounter1005

https://gerrit.wikimedia.org/r/753519

If I understand this task correctly, currently the Ganeti cluster is running on stretch nodes. The VM themselves have no explicit kvm:machine_version set, which on stretch nodes means "pc-i440fx-2.8" but on buster nodes that would default to "pc-i440fx-3.1" and thus they would be incompatible.

Wouldn't setting kvm:machine_version=pc-i440fx-2.8 as a global parameter make pc-i440fx-2.8 the default for buster nodes as well?

Wouldn't setting kvm:machine_version=pc-i440fx-2.8 as a global parameter make pc-i440fx-2.8 the default for buster nodes as well?

Yes, that's exactly what we did :-) But each instance only picks up that config chance with the next KVM restart.

All VMs have been restarted, thanks to everyone who helped with this!