Page MenuHomePhabricator

Cloudvirt instances failing to start: Image has no associated data
Closed, ResolvedPublic

Description

Most of the canary instances that failed, failed with the following error:

dcaro@cloudcontrol1003:~$ for vm in $(sudo wmcs-openstack server list --project cloudvirt-canary --name 'canary' -f json  | jq '.[] | select(.Status|test("ERROR")) | .ID' --raw-output); do sudo wmcs-openstack server show "$vm" -f json | jq '{name,fault}'; done
...
{
  "name": "canary1022-01",
  "fault": {
    "code": 500,
    "created": "2021-02-22T11:13:24Z",
    "message": "Build of instance 4747a794-19fe-4536-ad81-3f6fee8a504e aborted: Image 6b67c8a1-6356-464d-a885-0576d7263e51 is unacceptable: Image has no associated data",
    "details": "Traceback (most recent call last):\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py\", line 4025, in clone_fallback_to_fetch\n    backend.clone(context, disk_images['image_id'])\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/imagebackend.py\", line 980, in clone\n    reason=reason)\nnova.exception.ImageUnacceptable: Image 6b67c8a1-6356-464d-a885-0576d7263e51 is unacceptable: No image locations are accessible\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/usr/lib/python3/dist-packages/nova/compute/manager.py\", line 2437, in _build_and_run_instance\n    block_device_info=block_device_info)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py\", line 3539, in spawn\n    block_device_info=block_device_info)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py\", line 3925, in _create_image\n    injection_info, fallback_from_host)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py\", line 4033, in _create_and_inject_local_root\n    instance, size, fallback_from_host)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py\", line 9367, in _try_fetch_image_cache\n    trusted_certs=instance.trusted_certs)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/imagebackend.py\", line 275, in cache\n    *args, **kwargs)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/imagebackend.py\", line 940, in create_image\n    prepare_template(target=base, *args, **kwargs)\n  File \"/usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py\", line 328, in inner\n    return f(*args, **kwargs)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/imagebackend.py\", line 271, in fetch_func_sync\n    fetch_func(target=target, *args, **kwargs)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py\", line 4027, in clone_fallback_to_fetch\n    libvirt_utils.fetch_image(*args, **kwargs)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/utils.py\", line 420, in fetch_image\n    images.fetch_to_raw(context, image_id, target, trusted_certs)\n  File \"/usr/lib/python3/dist-packages/nova/virt/images.py\", line 153, in fetch_to_raw\n    fetch(context, image_href, path_tmp, trusted_certs)\n  File \"/usr/lib/python3/dist-packages/nova/virt/images.py\", line 144, in fetch\n    trusted_certs=trusted_certs)\n  File \"/usr/lib/python3/dist-packages/nova/image/api.py\", line 199, in download\n    trusted_certs=trusted_certs)\n  File \"/usr/lib/python3/dist-packages/nova/image/glance.py\", line 346, in download\n    reason='Image has no associated data')\nnova.exception.ImageUnacceptable: Image 6b67c8a1-6356-464d-a885-0576d7263e51 is unacceptable: Image has no associated data\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/usr/lib/python3/dist-packages/nova/compute/manager.py\", line 2161, in _do_build_and_run_instance\n    filter_properties, request_spec)\n  File \"/usr/lib/python3/dist-packages/nova/compute/manager.py\", line 2525, in _build_and_run_instance\n    reason=e.format_message())\nnova.exception.BuildAbortException: Build of instance 4747a794-19fe-4536-ad81-3f6fee8a504e aborted: Image 6b67c8a1-6356-464d-a885-0576d7263e51 is unacceptable: Image has no associated data\n"
  }
}
...

Event Timeline

dcaro triaged this task as High priority.Feb 22 2021, 11:48 AM
dcaro created this task.

It seems that they lost ceph authentication:

root@cloudvirt-wdqs1003:~# rbd ls -l eqiad1-glance-images | wc
2021-02-22 11:55:21.456 7f82f114d0c0 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
2021-02-22 11:55:21.456 7f82f114d0c0 -1 AuthRegistry(0x55f9851aa2a8) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,, disabling cephx
2021-02-22 11:55:21.456 7f82f114d0c0 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
2021-02-22 11:55:21.456 7f82f114d0c0 -1 AuthRegistry(0x7ffd845470b8) no keyring found at /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,, disabling cephx
rbd: couldn't connect to the cluster!
rbd: listing images failed: (2) No such file or directory
      0       0       0

Yep, the ones that are working have no problem accessing:

dcaro@cloudvirt1025:~$ sudo rbd ls -l eqiad1-glance-images | wc
151     608   15553

Looking into puppet...

Hmm... it seems that the image has no data from any of the nodes:

root@cloudvirt1035:~# glance image-download 6b67c8a1-6356-464d-a885-0576d7263e51 --file test.ovf
Image 6b67c8a1-6356-464d-a885-0576d7263e51 has no data.

The image is in ceph, and it seems to have data on it:

dcaro@cloudcontrol1003:~$ sudo rbd ls -l eqiad1-glance-images | grep 6b67c8a1-6356-464d-a885-0576d7263e51
dcaro@cloudcontrol1003:~$ sudo rbd export --pool=eqiad1-glance-images 6b67c8a1-6356-464d-a885-0576d7263e51 6b67c8a1-6356-464d-a885-0576d7263e51.buster.img
Exporting image: 47% complete...

Yep, it definetly has some stuff:

dcaro@cloudcontrol1003:~$ ls -la 6b67c8a1-6356-464d-a885-0576d7263e51.buster.img
-rw-r--r-- 1 root root 20401094656 Feb 22 12:24 6b67c8a1-6356-464d-a885-0576d7263e51.buster.img
dcaro@cloudcontrol1003:~$ du -hs 6b67c8a1-6356-464d-a885-0576d7263e51.buster.img
3.0G    6b67c8a1-6356-464d-a885-0576d7263e51.buster.img

Other images seem to download correctly, and also look ok on ceph side, so there mush be something specific about this
one...

root@cloudvirt1025:~# glance image-download ad7bee1a-a890-4fed-b851-6c02138683b0 --file test.iso
^C... terminating glance client

root@cloudvirt1025:~# ls -la test.iso
-rw-r--r-- 1 root root 2471428096 Feb 22 14:24 test.iso

Might just re-upload it see if that works.

Mentioned in SAL (#wikimedia-cloud) [2021-02-22T15:02:18Z] <dcaro> Re-uploaded the debian buster 10.0 image from rbd to glance, that worked, re-spawning all the broken instances (T275378)

This was actually the cause for all the other stuck VMs too, as a summary, the solution was:

  • Re-upload the image from rbd to glance again (from cloudcontrol1003.wikimedia.org):
> sudo rbd export --pool=eqiad1-glance-images 6b67c8a1-6356-464d-a885-0576d7263e51 debian-buster-amd64-$(date "+%Y%m%d").raw
> openstack image create --file  debian-buster-amd64-20210222.raw  --disk-format "raw" --container-format "ovf" --public "debian-10.0-buster"

Renamed the old image as deprecated following https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/VM_images

Note that nova image-meta does not exist anymore, but the old image did not have the default nor show properties set,
so I did not have to delet them.

Then deleted and created all the canaries (giving some time in-between for all the rabbitmq messages to get consumed),
and the vms started correctly this time.

Created a task to investigate the glance corruption T275407