Page MenuHomePhabricator

Rebooting more than 2 instances via Horizon sends one instance (at least?) into an error state
Closed, InvalidPublic

Description

Since there were many grid nodes to reboot today, I tried doing it in batches from Horizon. I found that every, single time, the last host in the list ends up in an error state that requires a hard reboot via CLI to recover. This seems like a bug. A 500 error was generated on interaction with the API at one point. It may not be "our bug" in the end, but this is to track things nonetheless.

Event Timeline

One example was rebooting: tools-sgeexec-0911, tools-sgeexec-0912 and tools-sgeexec-0913 at the same time. tools-sgeexec-0913 went into an error state with a 500 from the API. After hard reboot, it was recovered fully.

Note: these are repooled on the grid now, just FYI.

Mentioned in SAL (#wikimedia-cloud) [2019-02-25T21:41:47Z] <andrewbogott> depooling tools-sgeexec-0911, tools-sgeexec-0912, tools-sgeexec-0913 to test T217066

I was unable to reproduce this with test VMs. Trying again with the aforementioned exec nodes.

Worked fine when I just tried it :-/ Let me see if I can figure out an exact replica of what happened.

Mentioned in SAL (#wikimedia-cloud) [2019-02-25T23:20:20Z] <bstorm_> Depooled tools-sgeexec-0914 and tools-sgeexec-0915 for T217066

I just tried rebooting 5 at once from horizon the same way, and none failed. How maddening. Worse than that, the gridengine-exec service required no manual intervention to come up. LDAP was available right away, unlike in T217062.

So I found the errors. It's only on the hypervisor's nova logs:

2019-02-25 16:51:27.137 238816 ERROR nova.compute.manager [req-706ead47-e8c3-4071-8436-59de5cc3ec77 bstorm tools - - -] [instance: 5c7992fe-9570-4e68-9e61-23e0c7a203e8] Cannot reboot instance: internal error: process exited while connecting to monitor: base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive file=/var/lib/nova/instances/5c7992fe-9570-4e68-9e61-23e0c7a203e8/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=45,id=hostnet0,vhost=on,vhostfd=57 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:0f:da:b6,bus=pci.0,addr=0x3 -add-fd set=2,fd=59 -chardev file,id=charserial0,path=/dev/fdset/2,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -chardev pty,id=charchannel0 -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -spice port=5902,addr=0.0.0.0,disable-ticketing,
2019-02-25 16:51:27.144 238816 INFO nova.virt.libvirt.driver [-] [instance: 20f6dfb3-1e11-4e7b-ab5d-29b85804acfa] Instance rebooted successfully.
2019-02-25 16:51:27.307 238816 INFO nova.compute.manager [req-2cf2e366-cb1e-43b4-ab95-1e6c5dc90822 - - - - -] [instance: 20f6dfb3-1e11-4e7b-ab5d-29b85804acfa] VM Started (Lifecycle Event)
2019-02-25 16:51:27.616 238816 INFO nova.compute.manager [req-706ead47-e8c3-4071-8436-59de5cc3ec77 bstorm tools - - -] [instance: 5c7992fe-9570-4e68-9e61-23e0c7a203e8] Successfully reverted task state from reboot_started on failure for instance.
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server [req-706ead47-e8c3-4071-8436-59de5cc3ec77 bstorm tools - - -] Exception during message handling
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 133, in _process_incoming
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 150, in dispatch
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 121, in _do_dispatch
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 110, in wrapped
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     payload)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     self.force_reraise()
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 89, in wrapped
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     return f(self, context, *args, **kw)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 359, in decorated_function
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     LOG.warning(msg, e, instance=instance)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     self.force_reraise()
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 328, in decorated_function
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 409, in decorated_function
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 387, in decorated_function
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     kwargs['instance'], e, sys.exc_info())
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     self.force_reraise()
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 375, in decorated_function
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 3093, in reboot_instance
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     self._set_instance_obj_error_state(context, instance)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     self.force_reraise()
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 3074, in reboot_instance
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     bad_volumes_callback=bad_volumes_callback)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 2415, in reboot
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     block_device_info)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 2512, in _hard_reboot
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     vifs_already_plugged=True)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 4924, in _create_domain_and_network
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     xml, pause=pause, power_on=power_on)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 4855, in _create_domain
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     guest.launch(pause=pause)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/guest.py", line 142, in launch
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     self._encoded_xml, errors='ignore')
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     self.force_reraise()
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     six.reraise(self.type_, self.value, self.tb)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/guest.py", line 137, in launch
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     return self._domain.createWithFlags(flags)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 186, in doit
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     result = proxy_call(self._autowrap, f, *args, **kwargs)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 144, in proxy_call
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     rv = execute(f, *args, **kwargs)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 125, in execute
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     six.reraise(c, e, tb)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 83, in tworker
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     rv = meth(*args, **kwargs)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server   File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1065, in createWithFlags
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server     if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server libvirtError: internal error: process exited while connecting to monitor: base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive file=/var/lib/nova/instances/5c7992fe-9570-4e68-9e61-23e0c7a203e8/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=45,id=hostnet0,vhost=on,vhostfd=57 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:0f:da:b6,bus=pci.0,addr=0x3 -add-fd set=2,fd=59 -chardev file,id=charserial0,path=/dev/fdset/2,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -chardev pty,id=charchannel0 -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -spice port=5902,addr=0.0.0.0,disable-ticketing,
2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server

So that's what it looks like on the server. I got back a 500 with some little bits at the start. The instance ended up in the error state.

That looks to me like something crashed (libvirt, or kvm) or was killed. My only guess is that it's possible for a host to go OOM and kill off VMs if too many things are spinning up at once. I don't think that any of our hosts are that heavily loaded now, but that's the only theory I've got.

closing until/unless we see this again