Page MenuHomePhabricator

cloud: libvirt doesn't support live migration when using nested KVM
Closed, ResolvedPublic

Description

We recently enabled nested VMX virtualization in CloudVPS.

Today, while trying to drain an hypervisor I discovered a few VMs that won't migrate away.

The apparent reason:

Mar 2, 2021 @ 10:28:27.425
nova-compute
cloudvirt1023
ERROR
 - 
nova.virt.libvirt.driver
[-] [instance: 05f311e9-1ef4-4acd-b467-adb59f6c2f93] Live Migration failure: internal error: unable to execute QEMU command 'migrate': Nested VMX virtualization does not support live migration yet: libvirt.libvirtError: internal error: unable to execute QEMU command 'migrate': Nested VMX virtualization does not support live migration yet

According to https://forum.proxmox.com/threads/proxmox-6-x-nested-vmx-virtualization-does-not-support-live-migration-yet.58478/ and other sources, support for this is present using:

  • linux kernel >= 5.0
  • qemu >= 4.1.0

Event Timeline

aborrero triaged this task as High priority.
aborrero created this task.
aborrero moved this task from Inbox to Needs discussion on the cloud-services-team (Kanban) board.

VMs with the VMX cpu flag, barring some unreachable VMs:

aborrero@cloud-cumin-01:~$ sudo cumin --force -x '*' 'grep -q vmx /proc/cpuinfo && echo "Contains VMX cpu flag"'
[..]
===== NODE GROUP =====                                                                                                                                                                                             
(27) abogott-puppetclient.testlabs.eqiad1.wikimedia.cloud,alderaan.rcm.eqiad1.wikimedia.cloud,clouddb[1001,1004].clouddb-services.eqiad1.wikimedia.cloud,clouddb-wikireplicas-proxy-[1-2].clouddb-services.eqiad1.wikimedia.cloud,cloudinfra-acme-chief-01.cloudinfra.eqiad1.wikimedia.cloud,content-similarity-prototype.wmf-research-tools.eqiad1.wikimedia.cloud,cyberbot-exec-iabot-02.cyberbot.eqiad1.wikimedia.cloud,deployment-puppetdb03.deployment-prep.eqiad1.wikimedia.cloud,doc1002.devtools.eqiad1.wikimedia.cloud,doc.devtools.eqiad1.wikimedia.cloud,dumps-[4-5].dumps.eqiad1.wikimedia.cloud,maps-beta-1.entity-detection.eqiad1.wikimedia.cloud,region-groundtruth-test.wmf-research-tools.eqiad1.wikimedia.cloud,skins.reading-web-staging.eqiad1.wikimedia.cloud,spd-test.recommendation-api.eqiad1.wikimedia.cloud,toolhub-beta01.toolhub.eqiad1.wikimedia.cloud,tools-k8s-etcd-[7-8].tools.eqiad1.wikimedia.cloud,toolsbeta-test-k8s-etcd-[7-8].toolsbeta.eqiad1.wikimedia.cloud,toolserver-proxy-01.tools.eqiad1.wikimedia.cloud,wikidata-list-bulding.wmf-research-tools.eqiad1.wikimedia.cloud,wikipediaWikidata.wmf-research-tools.eqiad1.wikimedia.cloud,wikiwho-ios-experiments.mobile.eqiad1.wikimedia.cloud
----- OUTPUT of 'grep -q vmx /pro...ns VMX cpu flag"' -----                                                                                                                                                        
Contains VMX cpu flag  

A table with hypervisor / VM:

cloudvirt1019 clouddb1001 
cloudvirt1019 clouddb1004 
cloudvirt1023 abogott-puppetclient 
cloudvirt1023 deployment-puppetdb03 
cloudvirt1023 region-groundtruth-test 
cloudvirt1027 cloudinfra-acme-chief-01 
cloudvirt1027 cyberbot-exec-iabot-02 
cloudvirt1027 doc1002 
cloudvirt1027 skins 
cloudvirt1027 tools-k8s-etcd-8 
cloudvirt1027 wikidata-list-bulding 
cloudvirt1028 alderaan 
cloudvirt1028 clouddb-wikireplicas-proxy-1 
cloudvirt1028 doc 
cloudvirt1028 spd-test 
cloudvirt1028 tools-k8s-etcd-7 
cloudvirt1029 clouddb-wikireplicas-proxy-2 
cloudvirt1029 dumps-4 
cloudvirt1029 dumps-5 
cloudvirt1029 toolsbeta-test-k8s-etcd-7 
cloudvirt1029 wikipediaWikidata 
cloudvirt1029 wikiwho-ios-experiments 
cloudvirt1030 content-similarity-prototype 
cloudvirt1030 maps-beta-1 
cloudvirt1030 toolhub-beta01 
cloudvirt1030 toolsbeta-test-k8s-etcd-8 
cloudvirt1030 toolserver-proxy-01

Affected hypervisors (those that we cannot fully drain using ceph-enabled live-migrate today)

cloudvirt1019
cloudvirt1023
cloudvirt1027
cloudvirt1028
cloudvirt1029
cloudvirt1030

Mentioned in SAL (#wikimedia-cloud) [2021-03-02T11:59:07Z] <arturo> cloudvirt1023 is affected by T276208 and cannot be rebooted. Put it back into the ceph hos aggregate

<dcaro>"Migrating an L1 guest merely configured to support nesting, while not actually running L2 guests, is expected to function normally. Live-migrating an L2 guest from one L1 guest to another is also expected to succeed. "

<arturo> I saw an online comment mentioning that we need linux > 5 and qemu > 4.1 for it to work

So we have two problems, first how to reboot servers today, and what to do going forward.

  • If we remove cpu_model_extra_flags = vmx,pcid from nova.conf and reboot a given VM, does it become migrateable or is that setting sticky through reboots?
  • Can we 'live migrate' one of the affected VMs if it is shutdown?
  • How far off are we from using the kernel and qemu version that would make this feature work properly?

For the long term, I'm thinking we need to either make a reserved hypervisor in its own aggregate that supports this feature (and figure out how to forward that as a scheduling option to users), or simply not support this at all.

From my experience with the wmcs-drain-hypervisor.py script today, I think we can improve our workflows a bit:

  • the script fails if it finds a VMs in STOP state. From reading the code, I think this is the nova API refusing the run the migrate() routine. Perhaps a simple try/catch would work better, to get at least the most obvious VMs migrated out of the hypervisor.
  • the script fails if it finds a VM already in MIGRATING state. Again, I think this is the nova API refusing to run the routine. Perhaps a simple try/catch would work better, to get at least the most obvious VMs migrated out of the hypervisor.
  • we could introduce some special logic to handle such cases. I'm not sure if nova has some other facilities to handle migration (like, cold migration?). At that point, if standard migrate() doesn't work we probably don't care about stopping the VM and starting it on another hypervisor
  • from the script POV, a VM that can't be migrated is presented as a VM that gets scheduled in the same hypervisor again (regardless of the hypervisor being in an unschedulable host aggregate). We could also add some logic to detect this case and do force-migration for whatever method we decide.

If it helps yall make progress, I have no objection to removing the vmx flag change that was added on my behalf not too long ago.

Mentioned in SAL (#wikimedia-cloud) [2021-03-02T17:16:09Z] <andrewbogott> rebooting cloudvirt1039 to see if I can trigger T276208

Good (but silly) news: I seem not to have forwarded https://gerrit.wikimedia.org/r/c/operations/puppet/+/638146/ to the Train nova config. So new VMs made today won't have this issue (nor will they support nested VMs which is not ideal for @dancy but that's something we were going to have to figure out anyway.)

I made an existing VM migratable like this:

update instance_extra set vcpu_model='{"nova_object.name": "VirtCPUModel", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"arch": null, "vendor": null, "topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["sockets", "cores", "threads"]}, "features": [], "mode": "custom", "model": "Haswell-noTSX-IBRS", "match": "exact"}, "nova_object.changes": ["model", "features", "vendor", "topology", "mode", "match", "arch"]}'  where instance_uuid='9ce41938-f74c-40e4-81ec-5601e4fc4917';

I produced that vcpu_model content by taking the existing content and replacing the 'features' record with [].

Change 667928 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs-drain-hypervisor.py: Don't fail or lock up on VMs that aren't in state ACTIVE

https://gerrit.wikimedia.org/r/667928

If it helps yall make progress, I have no objection to removing the vmx flag change that was added on my behalf not too long ago.

@dancy does that mean that you aren't using that feature anymore, or should we figure out a way to still support this use? And, if you're still using it, can you suggest how many/what size VMs you ultimately hope to run this way?

@Andrew The project I was hoping to implement using WMCS and nested VMS didn't work out due to too many levels of VMs. It would still be nice to be able to use qemu on WMCS nodes but for the time being that is no longer a requirement.

@Andrew The project I was hoping to implement using WMCS and nested VMS didn't work out due to too many levels of VMs. It would still be nice to be able to use qemu on WMCS nodes but for the time being that is no longer a requirement.

Sounds good. We will leave it disabled for now but you should refer us back to https://phabricator.wikimedia.org/T267433 for instructions about how to re-enable if/when you need it :)

Since we're going to stop using this feature for now, I'm going to do that in-db hack to disable this feature for each affected server.

The first checkbox indicates that the db change has been made, the second that the host has been rebooted to accept the change.

  • [x] cloudvirt1019 clouddb1001
  • [x] cloudvirt1019 clouddb1004

(these two don't matter much since they're not on Ceph)

  • [x] cloudvirt1023 abogott-puppetclient
  • [x] cloudvirt1023 deployment-puppetdb03
  • [x] cloudvirt1023 region-groundtruth-test
  • [x] cloudvirt1027 cloudinfra-acme-chief-01
  • [x] cloudvirt1027 cyberbot-exec-iabot-02
  • [x] cloudvirt1027 doc1002
  • [x] cloudvirt1027 skins
  • [x] cloudvirt1027 tools-k8s-etcd-8
  • [x] cloudvirt1027 wikidata-list-bulding
  • [x] cloudvirt1028 alderaan
  • [x] cloudvirt1028 clouddb-wikireplicas-proxy-1
  • [x] cloudvirt1028 doc
  • [x] cloudvirt1028 spd-test
  • [x] cloudvirt1028 tools-k8s-etcd-7
  • [x] cloudvirt1029 clouddb-wikireplicas-proxy-2
  • [x] cloudvirt1029 dumps-4
  • [x] cloudvirt1029 dumps-5
  • [x] cloudvirt1029 toolsbeta-test-k8s-etcd-7
  • [x] cloudvirt1029 wikipediaWikidata
  • [x] cloudvirt1029 wikiwho-ios-experiments
  • [x] cloudvirt1030 content-similarity-prototype
  • [x] cloudvirt1030 maps-beta-1
  • [x] cloudvirt1030 toolhub-beta01
  • [x] cloudvirt1030 toolsbeta-test-k8s-etcd-8
  • [x] cloudvirt1030 toolserver-proxy-01

Mentioned in SAL (#wikimedia-cloud) [2021-03-02T23:04:50Z] <andrewbogott> rebooting cyberbot-exec-iabot-02 for T276208

Change 667928 merged by Andrew Bogott:
[operations/puppet@production] wmcs-drain-hypervisor.py: Better handling of VMS not in state ACTIVE

https://gerrit.wikimedia.org/r/667928

Andrew claimed this task.