Page MenuHomePhabricator

Start/shutdown VMs automatically on hypervisor boot/shutdown
Closed, ResolvedPublic

Description

Hopefully this would allow us to set VMs to autostart (virsh autostart $vm), without causing more trouble when we reboot hypervisors

https://github.com/libvirt/libvirt/blob/master/tools/libvirt-guests.sysconf#L15

Event Timeline

GTirloni renamed this task from Implement START_DELAY in libvirt to Start/shutdown VMs automatically on hypervisor boot/shutdown.Mar 2 2019, 1:24 PM
GTirloni claimed this task.
GTirloni triaged this task as High priority.

Change 493807 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] openstack: Automatically start/stop VMs on hypervisor boot/shutdown

https://gerrit.wikimedia.org/r/493807

Bstorm added subscribers: aborrero, Bstorm.

Discussed in meeting today and (without @aborrero and @GTirloni) everybody present likes the idea :)

I have one question: is this mechanism smart enough to don't start VMs on boot that were in shutoff before the reboot?

EDIT: I saw andrew asked the same question already.

Change 493807 merged by GTirloni:
[operations/puppet@production] openstack: Automatically start/stop VMs on hypervisor boot/shutdown

https://gerrit.wikimedia.org/r/493807

Change 496778 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] openstack: Only perform VM startup/shutdown on Stretch

https://gerrit.wikimedia.org/r/496778

Change 496778 merged by GTirloni:
[operations/puppet@production] openstack: Only perform VM startup/shutdown on Stretch

https://gerrit.wikimedia.org/r/496778

Change is applied and libvirt-guests tries to do its job as expected. However, it fails because the bridge interface fails to come up quick enough:

ar 15 14:39:38 cloudvirt1015 systemd[1]: Starting Suspend/Resume Running libvirt Guests...
Mar 15 14:39:39 cloudvirt1015 libvirt-guests.sh[1907]: Resuming guests on default URI...
Mar 15 14:39:39 cloudvirt1015 libvirt-guests.sh[1907]: Resuming guest i-00005835: error: Failed to start domain i-00005835
Mar 15 14:39:39 cloudvirt1015 libvirt-guests.sh[1907]: error: Cannot get interface MTU on 'brq7425e328-56': No such device
Mar 15 14:39:39 cloudvirt1015 systemd[1]: libvirt-guests.service: Main process exited, code=exited, status=1/FAILURE
Mar 15 14:39:39 cloudvirt1015 systemd[1]: Failed to start Suspend/Resume Running libvirt Guests.
Mar 15 14:39:39 cloudvirt1015 systemd[1]: libvirt-guests.service: Unit entered failed state.
Mar 15 14:39:39 cloudvirt1015 systemd[1]: libvirt-guests.service: Failed with result 'exit-code'.

After a few seconds, brq7425e328-56 becomes available and the VM can be started.

This situation is reported by many people since 2009 but there's no fix in place yet. See https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/495394

So the current situation is:

  • On shutdown, the VMs will be gracefully shutdown instead of an abrupt process kill. This is an improvement.
  • On boot, the VMs are not starting due to lack of network interface (racing condition).

Change is applied and libvirt-guests tries to do its job as expected. However, it fails because the bridge interface fails to come up quick enough:

ar 15 14:39:38 cloudvirt1015 systemd[1]: Starting Suspend/Resume Running libvirt Guests...
Mar 15 14:39:39 cloudvirt1015 libvirt-guests.sh[1907]: Resuming guests on default URI...
Mar 15 14:39:39 cloudvirt1015 libvirt-guests.sh[1907]: Resuming guest i-00005835: error: Failed to start domain i-00005835
Mar 15 14:39:39 cloudvirt1015 libvirt-guests.sh[1907]: error: Cannot get interface MTU on 'brq7425e328-56': No such device
Mar 15 14:39:39 cloudvirt1015 systemd[1]: libvirt-guests.service: Main process exited, code=exited, status=1/FAILURE
Mar 15 14:39:39 cloudvirt1015 systemd[1]: Failed to start Suspend/Resume Running libvirt Guests.
Mar 15 14:39:39 cloudvirt1015 systemd[1]: libvirt-guests.service: Unit entered failed state.
Mar 15 14:39:39 cloudvirt1015 systemd[1]: libvirt-guests.service: Failed with result 'exit-code'.

After a few seconds, brq7425e328-56 becomes available and the VM can be started.

This situation is reported by many people since 2009 but there's no fix in place yet. See https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/495394

So the current situation is:

  • On shutdown, the VMs will be gracefully shutdown instead of an abrupt process kill. This is an improvement.
  • On boot, the VMs are not starting due to lack of network interface (racing condition).

There should be a way for systemd services to express this dependency and for libvirt to wait until all neutron bridges are setup. Perhaps as simple as a dependency on neutron-linuxbridge-agent, I don't know.

GTirloni removed a subscriber: GTirloni.

Instead of using virsh autostart, could we let Nova resume the state of VMs after a hypervisor reboot?

This nova config option will provide that functionality

resume_guests_state_on_host_boot = False	(Boolean) Whether to start guests that were running before the host rebooted

https://docs.openstack.org/mitaka/config-reference/compute/config-options.html

Mentioned in SAL (#wikimedia-operations) [2019-07-12T15:25:05Z] <jeh> rebooting cloudvirt1018.eqiad.wmnet T216040

Mentioned in SAL (#wikimedia-operations) [2019-07-12T19:08:28Z] <jeh> rebooting cloudvirt1018.eqiad.wmnet T216040

Ran some tests with resume_guests_state_on_host_boot enabled and libvirt-guests configured to not start VMs.

After making the changes we can see that the VMs are getting cleanly shutdown

Jul 12 19:28:14 cloudvirt1018 systemd[1]: Stopping Suspend/Resume Running libvirt Guests...
Jul 12 19:28:15 cloudvirt1018 libvirt-guests.sh[4926]: Running guests on default URI: i-000067da, i-00009cdb
Jul 12 19:28:15 cloudvirt1018 libvirt-guests.sh[4926]: Shutting down guests on default URI...
Jul 12 19:28:15 cloudvirt1018 libvirt-guests.sh[4926]: Starting shutdown on guest: i-000067da
Jul 12 19:28:15 cloudvirt1018 libvirt-guests.sh[4926]: Starting shutdown on guest: i-00009cdb
Jul 12 19:28:16 cloudvirt1018 libvirt-guests.sh[4926]: Waiting for 2 guests to shut down, 300 seconds left
Jul 12 19:28:19 cloudvirt1018 libvirt-guests.sh[4926]: Shutdown of guest i-000067da complete.
Jul 12 19:28:19 cloudvirt1018 libvirt-guests.sh[4926]: Shutdown of guest i-00009cdb complete.
Jul 12 19:28:19 cloudvirt1018 systemd[1]: Stopped Suspend/Resume Running libvirt Guests.

Even though the VMs are shutdown and the hypervisor is offline, they still show the desired active state in nova.

+--------------------------------------+---------------+--------+----------------------------------------+
| ID                                   | Name          | Status | Networks                               |
+--------------------------------------+---------------+--------+----------------------------------------+
| 3e2b47dd-0df2-45bf-9f48-84ca6dc4eaf6 | jeh-cv1018-01 | ACTIVE | lan-flat-cloudinstances2b=172.16.7.100 |
| 587fe161-c82e-4d4f-89b4-de5492ec4296 | canary1018-01 | ACTIVE | lan-flat-cloudinstances2b=172.16.3.114 |
+--------------------------------------+---------------+--------+----------------------------------------+

When the host comes back online nova restarts the VMs that were previously active.

2019-07-12 19:37:31.465 2004 INFO nova.compute.manager [req-dc605927-b36d-420b-a3bd-9fbc47e843bd - - - - -] [instance: 587fe161-c82e-4d4f-89b4-de5492ec4296] Rebooting instance after nova-compute restart.
2019-07-12 19:37:31.500 2004 INFO nova.virt.libvirt.driver [-] [instance: 587fe161-c82e-4d4f-89b4-de5492ec4296] Instance destroyed successfully.
2019-07-12 19:37:32.317 2004 INFO nova.compute.manager [-] [instance: 587fe161-c82e-4d4f-89b4-de5492ec4296] VM Resumed (Lifecycle Event)
2019-07-12 19:37:32.326 2004 INFO nova.virt.libvirt.driver [-] [instance: 587fe161-c82e-4d4f-89b4-de5492ec4296] Instance rebooted successfully.
2019-07-12 19:37:32.337 2004 INFO nova.compute.manager [req-dc605927-b36d-420b-a3bd-9fbc47e843bd - - - - -] [instance: 3e2b47dd-0df2-45bf-9f48-84ca6dc4eaf6] Rebooting instance after nova-compute restart.
2019-07-12 19:37:32.370 2004 INFO nova.virt.libvirt.driver [-] [instance: 3e2b47dd-0df2-45bf-9f48-84ca6dc4eaf6] Instance destroyed successfully.
2019-07-12 19:37:32.452 2004 INFO nova.compute.manager [req-ce6d37ed-47c8-410b-b1f8-98aa3dcd2563 - - - - -] [instance: 587fe161-c82e-4d4f-89b4-de5492ec4296] VM Started (Lifecycle Event)
2019-07-12 19:37:34.547 2004 INFO nova.compute.manager [req-ce6d37ed-47c8-410b-b1f8-98aa3dcd2563 - - - - -] [instance: 3e2b47dd-0df2-45bf-9f48-84ca6dc4eaf6] VM Resumed (Lifecycle Event)
2019-07-12 19:37:34.557 2004 INFO nova.virt.libvirt.driver [-] [instance: 3e2b47dd-0df2-45bf-9f48-84ca6dc4eaf6] Instance rebooted successfully.
2019-07-12 19:37:34.689 2004 INFO nova.compute.manager [req-ce6d37ed-47c8-410b-b1f8-98aa3dcd2563 - - - - -] [instance: 3e2b47dd-0df2-45bf-9f48-84ca6dc4eaf6] VM Started (Lifecycle Event)

(Note that it's using the same process you'd see with openstack server start and openstack server stop. Destroy sounds scary, but it's actually just a shutdown event.)


Libvirt-guests should only be used to cleanly shutdown VMs that are managed by OpenStack.
Using nova-compute to manage which VMs should be running after a reboot ensures the network dependencies are in place as well as keeping in sync with the desired state within OpenStack.

Change 522548 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] openstack: resume VM state on host reboot

https://gerrit.wikimedia.org/r/522548

Change 522548 merged by Andrew Bogott:
[operations/puppet@production] openstack: resume VM state on host reboot

https://gerrit.wikimedia.org/r/522548

Changes pushed and verified