Page MenuHomePhabricator

Cloudvirt instances failing to start
Closed, ResolvedPublic

Description

After refreshing all the cloudcanaries some of them (9 so far) failed starting up, this is the parent task to triage
and fix all of them.

Current list of failing VMs:

dcaro@cloudcontrol1003:~$ sudo wmcs-openstack server list --project cloudvirt-canary --name 'canary' -f json  | jq '.[] | select(.Status|test("ERROR")) | .Name'
"canary-wdqs1003-01"
"canary-wdqs1002-01"
"canary-wdqs1001-01"
"canary1039-01"
"canary1037-01"
"canary1034-01"
"canary1022-01"
"canary1020-01"
"canary1019-01"

Event Timeline

dcaro triaged this task as High priority.Feb 22 2021, 11:34 AM
dcaro created this task.

There's two groups of errors:

"fault": {
  "code": 500,
  "created": "2021-02-22T11:12:47Z",
  "message": "[Errno 13] Permission denied: '/var/lib/nova/instances/locks/nova-612031ed4f080b36c1841be7b0a227577fcf7935'",
  "details": "Traceback (most recent call last):\n  File \"/usr/lib/python3/dist-packages/nova/compute/manager.py\", line 2437, in _build_and_run_instance\n    block_device_info=block_device_info)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py\", line 3539, in spawn\n    block_device_info=block_device_info)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py\", line 3925, in _create_image\n    injection_info, fallback_from_host)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py\", line 4033, in _create_and_inject_local_root\n    instance, size, fallback_from_host)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py\", line 9367, in _try_fetch_image_cache\n    trusted_certs=instance.trusted_certs)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/imagebackend.py\", line 275, in cache\n    *args, **kwargs)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/imagebackend.py\", line 637, in create_image\n    prepare_template(target=base, *args, **kwargs)\n  File \"/usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py\", line 321, in inner\n    fair=fair):\n  File \"/usr/lib/python3.7/contextlib.py\", line 112, in __enter__\n    return next(self.gen)\n  File \"/usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py\", line 269, in lock\n    ext_lock.acquire(delay=delay)\n  File \"/usr/lib/python3/dist-packages/fasteners/process_lock.py\", line 151, in acquire\n    self._do_open()\n  File \"/usr/lib/python3/dist-packages/fasteners/process_lock.py\", line 123, in _do_open\n    self.lockfile = open(self.path, 'a')\nPermissionError: [Errno 13] Permission denied: '/var/lib/nova/instances/locks/nova-612031ed4f080b36c1841be7b0a227577fcf7935'\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/usr/lib/python3/dist-packages/nova/compute/manager.py\", line 2161, in _do_build_and_run_instance\n    filter_properties, request_spec)\n  File \"/usr/lib/python3/dist-packages/nova/compute/manager.py\", line 2537, in _build_and_run_instance\n    instance_uuid=instance.uuid, reason=six.text_type(e))\nnova.exception.RescheduledException: Build of instance 6db4b8e6-e750-4140-86b5-c2d329b13499 was re-scheduled: [Errno 13] Permission denied: '/var/lib/nova/instances/locks/nova-612031ed4f080b36c1841be7b0a227577fcf7935'\n"
}

Affecting two hots, and:

{
  "name": "canary1022-01",
  "fault": {
    "code": 500,
    "created": "2021-02-22T11:13:24Z",
    "message": "Build of instance 4747a794-19fe-4536-ad81-3f6fee8a504e aborted: Image 6b67c8a1-6356-464d-a885-0576d7263e51 is unacceptable: Image has no associated data",
    "details": "Traceback (most recent call last):\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py\", line 4025, in clone_fallback_to_fetch\n    backend.clone(context, disk_images['image_id'])\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/imagebackend.py\", line 980, in clone\n    reason=reason)\nnova.exception.ImageUnacceptable: Image 6b67c8a1-6356-464d-a885-0576d7263e51 is unacceptable: No image locations are accessible\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/usr/lib/python3/dist-packages/nova/compute/manager.py\", line 2437, in _build_and_run_instance\n    block_device_info=block_device_info)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py\", line 3539, in spawn\n    block_device_info=block_device_info)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py\", line 3925, in _create_image\n    injection_info, fallback_from_host)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py\", line 4033, in _create_and_inject_local_root\n    instance, size, fallback_from_host)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py\", line 9367, in _try_fetch_image_cache\n    trusted_certs=instance.trusted_certs)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/imagebackend.py\", line 275, in cache\n    *args, **kwargs)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/imagebackend.py\", line 940, in create_image\n    prepare_template(target=base, *args, **kwargs)\n  File \"/usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py\", line 328, in inner\n    return f(*args, **kwargs)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/imagebackend.py\", line 271, in fetch_func_sync\n    fetch_func(target=target, *args, **kwargs)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py\", line 4027, in clone_fallback_to_fetch\n    libvirt_utils.fetch_image(*args, **kwargs)\n  File \"/usr/lib/python3/dist-packages/nova/virt/libvirt/utils.py\", line 420, in fetch_image\n    images.fetch_to_raw(context, image_id, target, trusted_certs)\n  File \"/usr/lib/python3/dist-packages/nova/virt/images.py\", line 153, in fetch_to_raw\n    fetch(context, image_href, path_tmp, trusted_certs)\n  File \"/usr/lib/python3/dist-packages/nova/virt/images.py\", line 144, in fetch\n    trusted_certs=trusted_certs)\n  File \"/usr/lib/python3/dist-packages/nova/image/api.py\", line 199, in download\n    trusted_certs=trusted_certs)\n  File \"/usr/lib/python3/dist-packages/nova/image/glance.py\", line 346, in download\n    reason='Image has no associated data')\nnova.exception.ImageUnacceptable: Image 6b67c8a1-6356-464d-a885-0576d7263e51 is unacceptable: Image has no associated data\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/usr/lib/python3/dist-packages/nova/compute/manager.py\", line 2161, in _do_build_and_run_instance\n    filter_properties, request_spec)\n  File \"/usr/lib/python3/dist-packages/nova/compute/manager.py\", line 2525, in _build_and_run_instance\n    reason=e.format_message())\nnova.exception.BuildAbortException: Build of instance 4747a794-19fe-4536-ad81-3f6fee8a504e aborted: Image 6b67c8a1-6356-464d-a885-0576d7263e51 is unacceptable: Image has no associated data\n"
  }
}

Affecting the rest, will start with the latter.

Bstorm added a subscriber: Bstorm.

There's a couple still to clean up in this task. The CLI still shows failed VMs under cloudvirt1019 and 1020 (somehow, which should be easy to tidy up) and a stranger one for canary1028-01. That one shows an error in Horizon of

Message
[Errno 28] No space left on device

I'll poke that a little.

So other people don't have to check the logs, here's a long paste

Feb 22, 2021 @ 16:47:10.595	nova-compute	cloudvirt1028	ERROR	5c715492-7147-4681-aab4-dd971db9d38a	nova.compute.manager	[instance: de419714-6d1f-4811-ae3b-0849aa600271] Failed to build and run instance: OSError: [Errno 28] No space left on device
...
2021-02-22 16:47:09.454 8949 ERROR nova.compute.manager [instance: de419714-6d1f-4811-ae3b-0849aa600271]   File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 4027, in clone_fallback_to_fetch
2021-02-22 16:47:09.454 8949 ERROR nova.compute.manager [instance: de419714-6d1f-4811-ae3b-0849aa600271]     libvirt_utils.fetch_image(*args, **kwargs)
2021-02-22 16:47:09.454 8949 ERROR nova.compute.manager [instance: de419714-6d1f-4811-ae3b-0849aa600271]   File "/usr/lib/python3/dist-packages/nova/virt/libvirt/utils.py", line 420, in fetch_image
2021-02-22 16:47:09.454 8949 ERROR nova.compute.manager [instance: de419714-6d1f-4811-ae3b-0849aa600271]     images.fetch_to_raw(context, image_id, target, trusted_certs)
2021-02-22 16:47:09.454 8949 ERROR nova.compute.manager [instance: de419714-6d1f-4811-ae3b-0849aa600271]   File "/usr/lib/python3/dist-packages/nova/virt/images.py", line 153, in fetch_to_raw
2021-02-22 16:47:09.454 8949 ERROR nova.compute.manager [instance: de419714-6d1f-4811-ae3b-0849aa600271]     fetch(context, image_href, path_tmp, trusted_certs)
2021-02-22 16:47:09.454 8949 ERROR nova.compute.manager [instance: de419714-6d1f-4811-ae3b-0849aa600271]   File "/usr/lib/python3/dist-packages/nova/virt/images.py", line 144, in fetch
2021-02-22 16:47:09.454 8949 ERROR nova.compute.manager [instance: de419714-6d1f-4811-ae3b-0849aa600271]     trusted_certs=trusted_certs)
2021-02-22 16:47:09.454 8949 ERROR nova.compute.manager [instance: de419714-6d1f-4811-ae3b-0849aa600271]   File "/usr/lib/python3/dist-packages/nova/image/api.py", line 199, in download
2021-02-22 16:47:09.454 8949 ERROR nova.compute.manager [instance: de419714-6d1f-4811-ae3b-0849aa600271]     trusted_certs=trusted_certs)
2021-02-22 16:47:09.454 8949 ERROR nova.compute.manager [instance: de419714-6d1f-4811-ae3b-0849aa600271]   File "/usr/lib/python3/dist-packages/nova/image/glance.py", line 391, in download
2021-02-22 16:47:09.454 8949 ERROR nova.compute.manager [instance: de419714-6d1f-4811-ae3b-0849aa600271]     {'path': dst_path, 'exception': ex})
2021-02-22 16:47:09.454 8949 ERROR nova.compute.manager [instance: de419714-6d1f-4811-ae3b-0849aa600271]   File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2021-02-22 16:47:09.454 8949 ERROR nova.compute.manager [instance: de419714-6d1f-4811-ae3b-0849aa600271]     self.force_reraise()
2021-02-22 16:47:09.454 8949 ERROR nova.compute.manager [instance: de419714-6d1f-4811-ae3b-0849aa600271]   File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2021-02-22 16:47:09.454 8949 ERROR nova.compute.manager [instance: de419714-6d1f-4811-ae3b-0849aa600271]     six.reraise(self.type_, self.value, self.tb)
2021-02-22 16:47:09.454 8949 ERROR nova.compute.manager [instance: de419714-6d1f-4811-ae3b-0849aa600271]   File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
2021-02-22 16:47:09.454 8949 ERROR nova.compute.manager [instance: de419714-6d1f-4811-ae3b-0849aa600271]     raise value
2021-02-22 16:47:09.454 8949 ERROR nova.compute.manager [instance: de419714-6d1f-4811-ae3b-0849aa600271]   File "/usr/lib/python3/dist-packages/nova/image/glance.py", line 378, in download
2021-02-22 16:47:09.454 8949 ERROR nova.compute.manager [instance: de419714-6d1f-4811-ae3b-0849aa600271]     data.write(chunk)
2021-02-22 16:47:09.454 8949 ERROR nova.compute.manager [instance: de419714-6d1f-4811-ae3b-0849aa600271] OSError: [Errno 28] No space left on device
2021-02-22 16:47:09.454 8949 ERROR nova.compute.manager [instance: de419714-6d1f-4811-ae3b-0849aa600271] 
	Feb 22, 2021 @ 16:47:08.605	nova-compute	cloudvirt1028	ERROR	5c715492-7147-4681-aab4-dd971db9d38a	nova.image.glance	Error writing to /var/lib/nova/instances/_base/7fdbfe934046e014ac6c5722f7d8bc1d12f2e43c.part: [Errno 28] No space left on device: OSError: [Errno 28] No space left on device

So the problem here is that it clones the image from glance to the main os disk, it seems (/var/lib/nova/instances). That patch is a bit busy now.

[bstorm@cloudvirt1028]:~ $ df -h
Filesystem             Size  Used Avail Use% Mounted on
udev                   252G     0  252G   0% /dev
tmpfs                   51G  4.1G   47G   8% /run
/dev/sda1               84G   61G   19G  77% /
tmpfs                  252G     0  252G   0% /dev/shm
tmpfs                  5.0M     0  5.0M   0% /run/lock
tmpfs                  252G     0  252G   0% /sys/fs/cgroup
/dev/mapper/tank-data  4.9T  5.2G  4.9T   1% /srv
tmpfs                   51G     0   51G   0% /run/user/18713
[bstorm@cloudvirt1028]:lib $ sudo du -sh * | sort -h
4.0K	ceph
4.0K	dhcp
4.0K	ebtables
...
58G	nova

It can't clone the full image to that path anymore to create another VM.

That's one of our backup systems, so I'm imagining /srv is taken up with backups. If we have to keep lots of image clones on these things, our whole "thinvirt" idea might be a bit incorrect.

Yes that's right. I confirmed on other cloudvirts that we have like a 20GB image on the disk. I'll spawn a ticket for this.

Removed the canaries on cloudvirt1019-1020

That cleans this up, now that the image issue is dealt with as well.