Since there were many grid nodes to reboot today, I tried doing it in batches from Horizon. I found that every, single time, the last host in the list ends up in an error state that requires a hard reboot via CLI to recover. This seems like a bug. A 500 error was generated on interaction with the API at one point. It may not be "our bug" in the end, but this is to track things nonetheless.
Description
Related Objects
- Mentioned Here
- T217062: gridengine-exec should start after nscld
Event Timeline
One example was rebooting: tools-sgeexec-0911, tools-sgeexec-0912 and tools-sgeexec-0913 at the same time. tools-sgeexec-0913 went into an error state with a 500 from the API. After hard reboot, it was recovered fully.
Note: these are repooled on the grid now, just FYI.
Mentioned in SAL (#wikimedia-cloud) [2019-02-25T21:41:47Z] <andrewbogott> depooling tools-sgeexec-0911, tools-sgeexec-0912, tools-sgeexec-0913 to test T217066
I was unable to reproduce this with test VMs. Trying again with the aforementioned exec nodes.
Worked fine when I just tried it :-/ Let me see if I can figure out an exact replica of what happened.
Mentioned in SAL (#wikimedia-cloud) [2019-02-25T23:20:20Z] <bstorm_> Depooled tools-sgeexec-0914 and tools-sgeexec-0915 for T217066
I just tried rebooting 5 at once from horizon the same way, and none failed. How maddening. Worse than that, the gridengine-exec service required no manual intervention to come up. LDAP was available right away, unlike in T217062.
So I found the errors. It's only on the hypervisor's nova logs:
2019-02-25 16:51:27.137 238816 ERROR nova.compute.manager [req-706ead47-e8c3-4071-8436-59de5cc3ec77 bstorm tools - - -] [instance: 5c7992fe-9570-4e68-9e61-23e0c7a203e8] Cannot reboot instance: internal error: process exited while connecting to monitor: base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive file=/var/lib/nova/instances/5c7992fe-9570-4e68-9e61-23e0c7a203e8/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=45,id=hostnet0,vhost=on,vhostfd=57 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:0f:da:b6,bus=pci.0,addr=0x3 -add-fd set=2,fd=59 -chardev file,id=charserial0,path=/dev/fdset/2,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -chardev pty,id=charchannel0 -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -spice port=5902,addr=0.0.0.0,disable-ticketing, 2019-02-25 16:51:27.144 238816 INFO nova.virt.libvirt.driver [-] [instance: 20f6dfb3-1e11-4e7b-ab5d-29b85804acfa] Instance rebooted successfully. 2019-02-25 16:51:27.307 238816 INFO nova.compute.manager [req-2cf2e366-cb1e-43b4-ab95-1e6c5dc90822 - - - - -] [instance: 20f6dfb3-1e11-4e7b-ab5d-29b85804acfa] VM Started (Lifecycle Event) 2019-02-25 16:51:27.616 238816 INFO nova.compute.manager [req-706ead47-e8c3-4071-8436-59de5cc3ec77 bstorm tools - - -] [instance: 5c7992fe-9570-4e68-9e61-23e0c7a203e8] Successfully reverted task state from reboot_started on failure for instance. 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server [req-706ead47-e8c3-4071-8436-59de5cc3ec77 bstorm tools - - -] Exception during message handling 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server Traceback (most recent call last): 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 133, in _process_incoming 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 150, in dispatch 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 121, in _do_dispatch 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 110, in wrapped 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server payload) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__ 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server self.force_reraise() 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 89, in wrapped 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server return f(self, context, *args, **kw) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 359, in decorated_function 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server LOG.warning(msg, e, instance=instance) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__ 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server self.force_reraise() 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 328, in decorated_function 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 409, in decorated_function 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 387, in decorated_function 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server kwargs['instance'], e, sys.exc_info()) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__ 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server self.force_reraise() 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 375, in decorated_function 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 3093, in reboot_instance 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server self._set_instance_obj_error_state(context, instance) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__ 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server self.force_reraise() 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 3074, in reboot_instance 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server bad_volumes_callback=bad_volumes_callback) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 2415, in reboot 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server block_device_info) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 2512, in _hard_reboot 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server vifs_already_plugged=True) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 4924, in _create_domain_and_network 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server xml, pause=pause, power_on=power_on) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 4855, in _create_domain 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server guest.launch(pause=pause) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/guest.py", line 142, in launch 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server self._encoded_xml, errors='ignore') 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__ 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server self.force_reraise() 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/guest.py", line 137, in launch 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server return self._domain.createWithFlags(flags) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 186, in doit 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server result = proxy_call(self._autowrap, f, *args, **kwargs) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 144, in proxy_call 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server rv = execute(f, *args, **kwargs) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 125, in execute 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server six.reraise(c, e, tb) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 83, in tworker 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server rv = meth(*args, **kwargs) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1065, in createWithFlags 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server libvirtError: internal error: process exited while connecting to monitor: base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive file=/var/lib/nova/instances/5c7992fe-9570-4e68-9e61-23e0c7a203e8/disk,format=qcow2,if=none,id=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=45,id=hostnet0,vhost=on,vhostfd=57 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:0f:da:b6,bus=pci.0,addr=0x3 -add-fd set=2,fd=59 -chardev file,id=charserial0,path=/dev/fdset/2,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -chardev pty,id=charchannel0 -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -spice port=5902,addr=0.0.0.0,disable-ticketing, 2019-02-25 16:51:27.627 238816 ERROR oslo_messaging.rpc.server
So that's what it looks like on the server. I got back a 500 with some little bits at the start. The instance ended up in the error state.
That looks to me like something crashed (libvirt, or kvm) or was killed. My only guess is that it's possible for a host to go OOM and kill off VMs if too many things are spinning up at once. I don't think that any of our hosts are that heavily loaded now, but that's the only theory I've got.