Page MenuHomePhabricator

cloudvirt ceph nodes can't launch new VMs
Closed, ResolvedPublic

Description

All five of the cloudvirt nodes are currently failing to launch VMs. This seems to have regressed during the Rocky upgrade.

fault                               | {'code': 500, 'created': '2020-05-14T16:17:18Z', 'message': 'internal error: qemu unexpectedly closed the monitor: Failed to open module: /usr/lib/x86_64-linux-gnu/qemu/block-rbd.so: undefined symbol: qobject_input_visitor_new_keyval\n2020-05-14T16:17:16.851857Z qemu-system-x86_64: -drive file=rbd:compute/ba4ba70d-6', 'details': 'Traceback (most recent call last):\n  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 2132, in _build_and_run_instance\n    block_device_info=block_device_info)\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 3080, in spawn\n    destroy_disks_on_failure=True)\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 5646, in _create_domain_and_network\n    destroy_disks_on_failure)\n  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n  File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise\n    raise value\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 5615, in _create_domain_and_network\n    post_xml_callback=post_xml_callback)\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 5550, in _create_domain\n    guest.launch(pause=pause)\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/guest.py", line 144, in launch\n    self._encoded_xml, errors=\'ignore\')\n  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n  File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise\n    raise value\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/guest.py", line 139, in launch\n    return self._domain.createWithFlags(flags)\n  File "/usr/lib/python3/dist-packages/eventlet/tpool.py", line 186, in doit\n    result = proxy_call(self._autowrap, f, *args, **kwargs)\n  File "/usr/lib/python3/dist-packages/eventlet/tpool.py", line 144, in proxy_call\n    rv = execute(f, *args, **kwargs)\n  File "/usr/lib/python3/dist-packages/eventlet/tpool.py", line 125, in execute\n    six.reraise(c, e, tb)\n  File "/usr/lib/python3/dist-packages/eventlet/support/six.py", line 625, in reraise\n    raise value\n  File "/usr/lib/python3/dist-packages/eventlet/tpool.py", line 83, in tworker\n    rv = meth(*args, **kwargs)\n  File "/usr/lib/python3/dist-packages/libvirt.py", line 1090, in createWithFlags\n    if ret == -1: raise libvirtError (\'virDomainCreateWithFlags() failed\', dom=self)\nlibvirt.libvirtError: internal error: qemu unexpectedly closed the monitor: Failed to open module: /usr/lib/x86_64-linux-gnu/qemu/block-rbd.so: undefined symbol: qobject_input_visitor_new_keyval\n2020-05-14T16:17:16.851857Z qemu-system-x86_64: -drive file=rbd:compute/ba4ba70d-6773-49e8-bf26-e3b6f8a9fb9d_disk:id=eqiad1-compute:auth_supported=cephx\\;none:mon_host=208.80.154.148\\:6789\\;208.80.154.149\\:6789\\;208.80.154.150\\:6789,file.password-secret=virtio-disk0-secret0,format=raw,if=none,id=drive-virtio-disk0,cache=writeback,throttling.bps-total=250000000,throttling.iops-total=250: Unknown protocol \'rbd\'\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 1940, in _do_build_and_run_instance\n    filter_properties, request_spec)\n  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 2229, in _build_and_run_instance\n    instance_uuid=instance.uuid, reason=six.text_type(e))\nnova.exception.RescheduledException: Build of instance ba4ba70d-6773-49e8-bf26-e3b6f8a9fb9d was re-scheduled: internal error: qemu unexpectedly closed the monitor: Failed to open module: /usr/lib/x86_64-linux-gnu/qemu/block-rbd.so: undefined symbol: qobject_input_visitor_new_keyval\n2020-05-14T16:17:16.851857Z qemu-system-x86_64: -drive file=rbd:compute/ba4ba70d-6773-49e8-bf26-e3b6f8a9fb9d_disk:id=eqiad1-compute:auth_supported=cephx\\;none:mon_host=208.80.154.148\\:6789\\;208.80.154.149\\:6789\\;208.80.154.150\\:6789,file.password-secret=virtio-disk0-secret0,format=raw,if=none,id=drive-virtio-disk0,cache=writeback,throttling.bps-total=250000000,throttling.iops-total=250: Unknown protocol \'rbd\'\n'}

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2020-05-14T21:12:52Z] <andrewbogott> rebuilding cloudvirt1003-wdqs as part of T252831

I tried Google driven development and came up with this: https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg02117.html

They don't match on cloudvirt1006

[bstorm@cloudvirt1006]:~ $ dpkg -l qemu-system-x86
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                    Version                  Architecture             Description
+++-=======================================-========================-========================-===================================================================================
ii  qemu-system-x86                         1:2.8+dfsg-6+deb9u9      amd64                    QEMU full system emulation binaries (x86)
[bstorm@cloudvirt1006]:~ $ dpkg -l qemu-block-extra
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                    Version                  Architecture             Description
+++-=======================================-========================-========================-===================================================================================
ii  qemu-block-extra:amd64                  1:2.12+dfsg-3+b1~bpo9+1  amd64                    extra block backend modules for qemu-system and qemu-utils
[bstorm@cloudvirt1006]:~ $

We do, indeed, have a version mismatch between these here.

They do match on cloudvirt1004 (because this has already been noticed):

[bstorm@cloudvirt1004]:~ $ dpkg -l qemu-system-x86
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                    Version                  Architecture             Description
+++-=======================================-========================-========================-===================================================================================
ii  qemu-system-x86                         1:2.12+dfsg-3+b1~bpo9+1  amd64                    QEMU full system emulation binaries (x86)
[bstorm@cloudvirt1004]:~ $ dpkg -l qemu-block-extra
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                    Version                  Architecture             Description
+++-=======================================-========================-========================-===================================================================================
ii  qemu-block-extra:amd64                  1:2.12+dfsg-3+b1~bpo9+1  amd64                    extra block backend modules for qemu-system and qemu-utils
[bstorm@cloudvirt1004]:~ $

Apparently, they also needed to fiddle with qemu.conf to change the user/group. Ours is all defaults (root:root) apparently. However, that's what people reported changing theirs to (basically uncommenting the line).

They do not match on wdqs1003:

[bstorm@cloudvirt-wdqs1003]:~ $ dpkg -l qemu-system-x86
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                    Version                  Architecture             Description
+++-=======================================-========================-========================-===================================================================================
ii  qemu-system-x86                         1:2.8+dfsg-6+deb9u9      amd64                    QEMU full system emulation binaries (x86)
[bstorm@cloudvirt-wdqs1003]:~ $ dpkg -l qemu-block-extra
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                    Version                  Architecture             Description
+++-=======================================-========================-========================-===================================================================================
ii  qemu-block-extra:amd64                  1:2.12+dfsg-3+b1~bpo9+1  amd64                    extra block backend modules for qemu-system and qemu-utils

@JHedden Tells me that when we matched up the versions, a new error popped up

cloudvirt1004 libvirtd[7740]: 2020-05-14 17:12:39.612+0000: 7742: error : qemuBuildNicDevStr:3509 : unsupported configuration: setting MTU is not supported with this QEMU binary

Full error after upgrading the qemu packages to match package versions:

cloudvirt1004:~$ dpkg -l | grep qemu
ii  ipxe-qemu                            1.0.0+git-20161027.b991c67-1                      all          PXE boot firmware - ROM images for qemu
ii  qemu-block-extra:amd64               1:2.12+dfsg-3+b1~bpo9+1                           amd64        extra block backend modules for qemu-system and qemu-utils
ii  qemu-kvm                             1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU Full virtualization on x86 hardware
ii  qemu-slof                            20161019+dfsg-1                                   all          Slimline Open Firmware -- QEMU PowerPC version
ii  qemu-system                          1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU full system emulation binaries
ii  qemu-system-arm                      1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU full system emulation binaries (arm)
ii  qemu-system-common                   1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU full system emulation binaries (common files)
ii  qemu-system-data                     1:2.12+dfsg-3+b1~bpo9+1                           all          QEMU full system emulation (data files)
ii  qemu-system-mips                     1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU full system emulation binaries (mips)
ii  qemu-system-misc                     1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU full system emulation binaries (miscellaneous)
ii  qemu-system-ppc                      1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU full system emulation binaries (ppc)
ii  qemu-system-sparc                    1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU full system emulation binaries (sparc)
ii  qemu-system-x86                      1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU full system emulation binaries (x86)
ii  qemu-utils                           1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU utilities
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [req-27c682bb-eda6-41a6-972e-766d35d73b54 novaadmin testlabs - default default] [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23] Instance failed to spawn: libvirt.libvirtError: unsupported configuration: setting MTU is not supported with this QEMU binary
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23] Traceback (most recent call last):
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]   File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 2368, in _build_resources
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]     yield resources
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]   File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 2132, in _build_and_run_instance
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]     block_device_info=block_device_info)
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]   File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 3080, in spawn
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]     destroy_disks_on_failure=True)
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]   File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 5646, in _create_domain_and_network
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]     destroy_disks_on_failure)
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]   File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]     self.force_reraise()
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]   File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]     six.reraise(self.type_, self.value, self.tb)
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]   File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]     raise value
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]   File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 5615, in _create_domain_and_network
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]     post_xml_callback=post_xml_callback)
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]   File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 5550, in _create_domain
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]     guest.launch(pause=pause)
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]   File "/usr/lib/python3/dist-packages/nova/virt/libvirt/guest.py", line 144, in launch
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]     self._encoded_xml, errors='ignore')
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]   File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]     self.force_reraise()
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]   File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]     six.reraise(self.type_, self.value, self.tb)
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]   File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]     raise value
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]   File "/usr/lib/python3/dist-packages/nova/virt/libvirt/guest.py", line 139, in launch
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]     return self._domain.createWithFlags(flags)
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]   File "/usr/lib/python3/dist-packages/eventlet/tpool.py", line 186, in doit
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]     result = proxy_call(self._autowrap, f, *args, **kwargs)
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]   File "/usr/lib/python3/dist-packages/eventlet/tpool.py", line 144, in proxy_call
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]     rv = execute(f, *args, **kwargs)
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]   File "/usr/lib/python3/dist-packages/eventlet/tpool.py", line 125, in execute
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]     six.reraise(c, e, tb)
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]   File "/usr/lib/python3/dist-packages/eventlet/support/six.py", line 625, in reraise
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]     raise value
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]   File "/usr/lib/python3/dist-packages/eventlet/tpool.py", line 83, in tworker
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]     rv = meth(*args, **kwargs)
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]   File "/usr/lib/python3/dist-packages/libvirt.py", line 1090, in createWithFlags
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23]     if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
2020-05-14 17:10:07.168 1365 ERROR nova.compute.manager [instance: 9ca027cd-5917-4974-ab82-4fe7620abd23] libvirt.libvirtError: unsupported configuration: setting MTU is not supported with this QEMU binary

Here is an example creation command:

root@cloudcontrol1003:~# source ~/novaenv.sh
root@cloudcontrol1003:~# OS_PROJECT_ID=testlabs openstack server create --flavor m1.small-ceph --image debian-10.0-buster --availability-zone host:cloudvirt1006 --nic net-id=7425e328-560c-4f00-8e99-706f3fb90bb4 cephtest-cloudvirt1006-2

It's quite possible that the MTU error represents forward progress. Possible but uncertain

I tried the whole "setting the user manually" in /etc/libvirt/qemu.conf by uncommenting the lines about user and group and then restarted libvirtd

Attempting to build one on cloudvirt1004 (where I'm messing with this).

I downgraded some things on cloudvirt1004 and now I'm able to launch VMs.

ii  qemu                                 1:2.8+dfsg-6+deb9u8                               amd64        fast processor emulator
ii  qemu-block-extra:amd64               1:2.8+dfsg-6+deb9u8                               amd64        extra block backend modules for qemu-system and qemu-utils
ii  qemu-kvm                             1:2.8+dfsg-6+deb9u8                               amd64        QEMU Full virtualization on x86 hardware
ii  qemu-system-x86                      1:2.8+dfsg-6+deb9u8                               amd64        QEMU full system emulation binaries (x86)

No idea if that's a correct solution or not.

Mentioned in SAL (#wikimedia-cloud) [2020-05-14T22:15:10Z] <bstorm_> changing /etc/libvirt/qemu.conf and restarting libvirtd on cloudvirt1006 T252831

Mentioned in SAL (#wikimedia-cloud) [2020-05-14T22:21:04Z] <bstorm_> upgrading qemu-system-x86 on cloudvirt1006 to backports version T252831

Next steps are:

  • Get all qemu packages to the same 1:2.8 version, make sure things still work
  • Figure out if openstack upstream thinks that 1:2.8 is an OK match with Rocky
  • Figure out how to enforce all this via puppet
  • Maybe file a packaging bug if it turns out that dpkg dependencies are responsible for this issue

Just a note that changing the /etc/libvirt/qemu.conf to be explicit about user/group made no difference on the later version of the package and neither did ldconfig. I cannot get the newer packages from backports to do anything but complain about MTU capabilities, which is a compile-time option AFAICT.

Mentioned in SAL (#wikimedia-cloud) [2020-05-14T23:28:53Z] <bstorm_> downtimed cloudvirt1004/6 and cloudvirt-wdqs1003 until tomorrow around this time T252831

The fix is not to /run/ qemu version 1:2.8, but rather to have ever installed it. This, for example, fixed cloudvirt-wdqs1001:

root@cloudvirt-wdqs1001:~# apt-get install qemu-kvm=1:2.8+dfsg-6+deb9u9 qemu=1:2.8+dfsg-6+deb9u9 qemu-system-x86=1:2.8+dfsg-6+deb9u9 qemu-block-extra=1:2.8+dfsg-6+deb9u9
root@cloudvirt-wdqs1001:~# apt-get install qemu-kvm qemu qemu-system-x86 qemu-block-extra
root@cloudvirt-wdqs1001:~# dpkg --purge qemu

That puts the output of dpkg --list back exactly where it was before, but beforehand VMs error out and afterwards they work.

I thought the install/uninstall dance was doing something with kernel modules but diffing lsmod outputs didn't get me anywhere.

This may represent something useful:

[bstorm@cloudvirt-wdqs1003]:~ $ dpkg -l | grep qemu-
ii  qemu-block-extra:amd64               1:2.12+dfsg-3+b1~bpo9+1                           amd64        extra block backend modules for qemu-system and qemu-utils
ii  qemu-kvm                             1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU Full virtualization on x86 hardware
ii  qemu-slof                            20161019+dfsg-1                                   all          Slimline Open Firmware -- QEMU PowerPC version
ii  qemu-system                          1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU full system emulation binaries
ii  qemu-system-arm                      1:2.8+dfsg-6+deb9u9                               amd64        QEMU full system emulation binaries (arm)
ii  qemu-system-common                   1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU full system emulation binaries (common files)
ii  qemu-system-mips                     1:2.8+dfsg-6+deb9u9                               amd64        QEMU full system emulation binaries (mips)
ii  qemu-system-misc                     1:2.8+dfsg-6+deb9u9                               amd64        QEMU full system emulation binaries (miscellaneous)
ii  qemu-system-ppc                      1:2.8+dfsg-6+deb9u9                               amd64        QEMU full system emulation binaries (ppc)
ii  qemu-system-sparc                    1:2.8+dfsg-6+deb9u9                               amd64        QEMU full system emulation binaries (sparc)
ii  qemu-system-x86                      1:2.8+dfsg-6+deb9u9                               amd64        QEMU full system emulation binaries (x86)
ii  qemu-utils                           1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU utilities
[bstorm@cloudvirt-wdqs1001]:~ $ dpkg -l | grep qemu
ii  ipxe-qemu                            1.0.0+git-20161027.b991c67-1                      all          PXE boot firmware - ROM images for qemu
ii  qemu-block-extra:amd64               1:2.12+dfsg-3+b1~bpo9+1                           amd64        extra block backend modules for qemu-system and qemu-utils
ii  qemu-kvm                             1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU Full virtualization on x86 hardware
ii  qemu-slof                            20161019+dfsg-1                                   all          Slimline Open Firmware -- QEMU PowerPC version
ii  qemu-system                          1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU full system emulation binaries
ii  qemu-system-arm                      1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU full system emulation binaries (arm)
ii  qemu-system-common                   1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU full system emulation binaries (common files)
ii  qemu-system-data                     1:2.12+dfsg-3+b1~bpo9+1                           all          QEMU full system emulation (data files)
ii  qemu-system-mips                     1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU full system emulation binaries (mips)
ii  qemu-system-misc                     1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU full system emulation binaries (miscellaneous)
ii  qemu-system-ppc                      1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU full system emulation binaries (ppc)
ii  qemu-system-sparc                    1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU full system emulation binaries (sparc)
ii  qemu-system-x86                      1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU full system emulation binaries (x86)
ii  qemu-user                            1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU user mode emulation binaries
ii  qemu-utils                           1:2.12+dfsg-3+b1~bpo9+1                           amd64        QEMU utilitiesmulation binaries

Upgrading the whole lot of those on cloudvirt-wdqs1003:
sudo apt-get install qemu-kvm qemu-system qemu-block-extra qemu-system-arm qemu-system-common qemu-system-mips qemu-system-misc qemu-system-ppc qemu-system-sparc qemu-system-x86

This just produced our familiar MTU error.

Just noticed that one thing that installing qemu does is it installs qemu-user. This is left behind on wdqs1001.

qemu-user install didn't help, checking a few more things.

Ok, got cloudvirt1003-wdqs1003 working. I downgraded only qemu-kvm and qemu-block-extra before it started working. One of those has the key.

Below is the lsmod diff between not-working and working:
(left is working, right is not working)

2,20c2
< ebt_arp                16384  1
< ebt_among              16384  1
< ip6table_raw           16384  1
< nf_conntrack_ipv6      20480  7
< nf_defrag_ipv6         16384  1 nf_conntrack_ipv6
< xt_CT                  16384  6
< xt_mac                 16384  1
< xt_tcpudp              16384  17
< nf_conntrack_ipv4      16384  7
< nf_defrag_ipv4         16384  1 nf_conntrack_ipv4
< xt_comment             16384  42
< xt_physdev             16384  14
< xt_set                 16384  1
< xt_conntrack           16384  8
< nf_conntrack          114688  4 nf_conntrack_ipv6,nf_conntrack_ipv4,xt_CT,xt_conntrack
< ip_set_hash_net        32768  1
< ip_set                 45056  2 xt_set,ip_set_hash_net
< nfnetlink              16384  1 ip_set
< vhost_net              20480  1
---
> vhost_net              20480  0
24c6
< tun                    28672  3 vhost_net
---
> tun                    28672  1 vhost_net
26,27c8,9
< ebtable_nat            16384  1
< iptable_raw            16384  1
---
> ebtable_nat            16384  0
> iptable_raw            16384  0
30,32c12,14
< ip6table_filter        16384  1
< ip6_tables             28672  9 ip6table_filter,ip6table_raw
< iptable_filter         16384  1
---
> ip6table_filter        16384  0
> ip6_tables             28672  1 ip6table_filter
> iptable_filter         16384  0
47c29
< kvm_intel             200704  3
---
> kvm_intel             200704  0
53c35
< irqbypass              16384  4 kvm
---
> irqbypass              16384  1 kvm
89c71
< x_tables               36864  16 xt_comment,ebt_among,ip_tables,ebtables,iptable_filter,xt_set,xt_mac,xt_tcpudp,iptable_raw,ebt_arp,ip6table_filter,xt_CT,ip6table_raw,xt_physdev,xt_conntrack,ip6_tables
---
> x_tables               36864  6 ip_tables,ebtables,iptable_filter,iptable_raw,ip6table_filter,ip6_tables

There was no lsmod diff until I upgraded again and deployed a VM, so I'm not sure that means anything re:package installs.

One concern, if downgrade then upgrade works, does a host at the current backports version survive a reboot?

Mentioned in SAL (#wikimedia-cloud) [2020-05-15T18:44:20Z] <bstorm_> rebooting cloudvirt-wdqs1003 T252831

Fixed cloudvirt-wdqs1002 with:

root@cloudvirt-wdqs1002:~# apt-get install qemu-kvm=1:2.8+dfsg-6+deb9u9 qemu-block-extra=1:2.8+dfsg-6+deb9u9
root@cloudvirt-wdqs1002:~# apt-get install qemu-block-extra qemu-kvm qemu-system qemu-system-arm qemu-system-common qemu-system-mips qemu-system-misc qemu-system-ppc qemu-system-sparc qemu-system-x86

Andrew changed the task status from Open to Stalled.May 19 2020, 4:04 PM

This is resolved for existing nodes. I'm keeping this open as a reference, though, because we'll need to do the same song-and-dance for any future odes that are moved to ceph.

Andrew triaged this task as Medium priority.May 19 2020, 4:04 PM
Andrew moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.
Andrew claimed this task.

everything that was going to move is moved