Page MenuHomePhabricator

OpenStack silently fail to resize an Ephemeral volume
Closed, DeclinedPublic

Description

The WMCS integration project holds instances running Jenkins agents. We need more disk space and via T340070 I have requested a larger flavor raising the Ephemeral disk space from 60GB to 90GB (g3.cores8.ram24.disk20.ephemeral60.4xiops and g3.cores8.ram24.disk20.ephemeral90.4xiops).

I have changed the flavor for integration-agent-docker-1039 via Horizon. It instantly reboots the instance and once rebooted the Linux kernel still sees sdb as a 60G disk:

[Thu Jun 29 17:27:03 2023] sd 2:0:0:1: [sdb] 125829120 512-byte logical blocks: (64.4 GB/60.0 GiB)

I tried to hard reboot it and that does not change anything: the Ephemeral disk is not resized.


Various links:

https://access.redhat.com/solutions/3081971#fn:1 is titled Instance cannot resize ephemeral disk in Red Hat OpenStack and marked as Solution in progress. Then it is behind a paywall so there are not much details. I am guessing OpenStack is unable to resize ephemeral disk space.

I also found an old nova bug which mentions issue resizing rbd disks: SRU: nova resize doesn't resize(extend) rbd disk files when using rbd disk backend but that got marked fixed a while ago.


Maybe there are some traces on our backend which can help diagnose the root cause?

Is that doable using openstack resize or maybe cinder <volume ID> <new GB size>? If we could manually resize the Ephemeral disks that would save us from having to rebuild and reprovision the whole fleet of instances which is rather costly ;]

Errors found by @dcaro:

On the cloudvirt that runs the VM:

Jun 29 17:28:30 cloudvirt1047 nova-compute[1277452]: 2023-06-29 17:28:30.252 1277452 WARNING nova.virt.libvirt.imagecache [None req-512c7717-f075-46da-802e-4405adfbccac - - - - - -] ephemeral_60_40d1d2c ephemeral image was used by instance but no back files existing!

And the full stacktrace from Logstash:

[None req-4fbac892-c082-478c-a411-2195daf6b45c hashar integration - - default default]
Exception during message handling:
nova.exception.ResizeError:
Resize error: Unable to resize disk down.
#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 10596, in _error_out_instance_on_exception
    yield
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 5874, in _resize_instance
    disk_info = self.driver.migrate_disk_and_power_off(
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 11377, in migrate_disk_and_power_off
    raise exception.InstanceFaultRollback(
nova.exception.InstanceFaultRollback: Instance rollback performed due to: Resize error: Unable to resize disk down.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
    res = self.dispatcher.dispatch(message)
  File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
    return self._do_dispatch(endpoint, method, ctxt, args)
  File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
    result = func(ctxt, **new_args)
  File "/usr/lib/python3/dist-packages/nova/exception_wrapper.py", line 71, in wrapped
    _emit_versioned_exception_notification(
  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
    self.force_reraise()
  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
    raise self.value
  File "/usr/lib/python3/dist-packages/nova/exception_wrapper.py", line 63, in wrapped
    return f(self, context, *args, **kw)
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 184, in decorated_function
    LOG.warning("Failed to revert task state for instance. "
  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
    self.force_reraise()
  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
    raise self.value
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 155, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/nova/compute/utils.py", line 1439, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 212, in decorated_function
    compute_utils.add_instance_fault_from_exc(context,
  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
    self.force_reraise()
  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
    raise self.value
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 201, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 5841, in resize_instance
    self._revert_allocation(context, instance, migration)
  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
    self.force_reraise()
  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
    raise self.value
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 5837, in resize_instance
    self._resize_instance(context, instance, image, migration,
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 5903, in _resize_instance
    self.compute_rpcapi.finish_resize(context, instance,
  File "/usr/lib/python3.9/contextlib.py", line 135, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 10609, in _error_out_instance_on_exception
    raise error.inner_exception
nova.exception.ResizeError: Resize error: Unable to resize disk down.

Event Timeline

Found this on logstash:

[None req-4fbac892-c082-478c-a411-2195daf6b45c hashar integration - - default default] Exception during message handling: nova.exception.ResizeError: Resize error: Unable to resize disk down.#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server Traceback (most recent call last):#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 10596, in _error_out_instance_on_exception#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     yield#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 5874, in _resize_instance#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     disk_info = self.driver.migrate_disk_and_power_off(#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 11377, in migrate_disk_and_power_off#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     raise exception.InstanceFaultRollback(#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server nova.exception.InstanceFaultRollback: Instance rollback performed due to: Resize error: Unable to resize disk down.#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server #0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server During handling of the above exception, another exception occurred:#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server #0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server Traceback (most recent call last):#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/exception_wrapper.py", line 71, in wrapped#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     _emit_versioned_exception_notification(#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 227, in __exit__#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     self.force_reraise()#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     raise self.value#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/exception_wrapper.py", line 63, in wrapped#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     return f(self, context, *args, **kw)#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 184, in decorated_function#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     LOG.warning("Failed to revert task state for instance. "#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 227, in __exit__#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     self.force_reraise()#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     raise self.value#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 155, in decorated_function#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/compute/utils.py", line 1439, in decorated_function#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 212, in decorated_function#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     compute_utils.add_instance_fault_from_exc(context,#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 227, in __exit__#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     self.force_reraise()#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     raise self.value#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 201, in decorated_function#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 5841, in resize_instance#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     self._revert_allocation(context, instance, migration)#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 227, in __exit__#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     self.force_reraise()#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     raise self.value#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 5837, in resize_instance#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     self._resize_instance(context, instance, image, migration,#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 5903, in _resize_instance#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     self.compute_rpcapi.finish_resize(context, instance,#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/contextlib.py", line 135, in __exit__#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     self.gen.throw(type, value, traceback)#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 10609, in _error_out_instance_on_exception#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server     raise error.inner_exception#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server nova.exception.ResizeError: Resize error: Unable to resize disk down.#0122023-06-29 17:59:55.733 1277452 ERROR oslo_messaging.rpc.server

Found this on the cloudvirt that runs the VM:

Jun 29 17:28:30 cloudvirt1047 nova-compute[1277452]: 2023-06-29 17:28:30.252 1277452 WARNING nova.virt.libvirt.imagecache [None req-512c7717-f075-46da-802e-4405adfbccac - - - - - -] ephemeral_60_40d1d2c ephemeral image was used by instance but no back files existing!

That sounds more like the issue

Found this on the cloudvirt that runs the VM:

Jun 29 17:28:30 cloudvirt1047 nova-compute[1277452]: 2023-06-29 17:28:30.252 1277452 WARNING nova.virt.libvirt.imagecache [None req-512c7717-f075-46da-802e-4405adfbccac - - - - - -] ephemeral_60_40d1d2c ephemeral image was used by instance but no back files existing!

That sounds more like the issue

Log message is from https://github.com/openstack/nova/blob/master/nova/virt/libvirt/imagecache.py#L295-L296

I could not find any further logs from that action.

The code found by Bryan comes from https://github.com/openstack/nova/commit/f44700935ff4cab7a36cc06356aefcf4c9b48880 which has commit message:

Include removal of ephemeral backing files in the image cache manager

If CONF.image_cache.remove_unused_base_images is True, the base and swap files are removed during the image cache manager's periodic task while the ephemeral backing files are never deleted.
This is a long standing bug and this patch proposes to remove the ephemeral backing files in the same way as for the swap files.

Maybe it creates the new ephemeral image, the cache managers attempt to collect the old one but since it is still used that issues the warning.

The issue comes from libvirt migrate_disk_and_power_off:

  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 10596, in _error_out_instance_on_exception
    yield
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 5874, in _resize_instance
    disk_info = self.driver.migrate_disk_and_power_off(
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 11377, in migrate_disk_and_power_off
    raise exception.InstanceFaultRollback(
nova.exception.InstanceFaultRollback: Instance rollback performed due to: Resize error: Unable to resize disk down.

And I don't know why it states Unable to resize disk down when the ephemeral should be bumped from 60G to 90G. Maybe that is because I tried to move back from 90G to 60G. I guess I should conduct a new test :)

I have created a T340825-flavor-migration instance with g3.cores8.ram24.disk20.ephemeral60.4xiops at 14:27 UTC. It is running on cloudvirt1028. The instance information shows the ephemeral disk is at 60G. Via horizon I went to resize it to g3.cores8.ram24.disk20.ephemeral90.4xiops (and the modal window tells me the flavor has an Ephemeral disk of 90GB).

I got a popup stating //Success: Request for resizing of instance T340825-flavor-migration" has been submitted.. The image is in a Resizing or Migrating state. The request id is req-0b8daf98-d76c-4135-92f9-c8a9c8c0a565 . It is then in Confirm or Revert Resize/Migrate state.

The instance has been restarted:

reboot   system boot  6.1.0-0.deb11.7- Mon Jul  3 14:32   still running
root     ttyS0                         Mon Jul  3 14:31 - down   (00:00)

But lsblk /dev/sdb still shows 60GB:

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb    8:16   0  60G  0 disk

The 4 log entries for cloudvirt1028 in chronological order:

Jul 3, 2023 @ 14:31:58.786 nova-compute cloudvirt1028 WARNING eb7a2b14-bcf5-446d-81ac-ab949c72f9f4
nova.compute.manager
[instance: bf359b44-4c3a-488a-8d88-5696802400a8] Received unexpected event network-vif-plugged-6724172f-0375-4ca7-8701-edcbe2f9d531 for instance with vm_state resized and task_state None.

Jul 3, 2023 @ 14:31:53.907 nova-compute cloudvirt1028 WARNING 68ccf364-d3f1-4433-9cff-aa796acd4b0e
nova.compute.manager
[instance: bf359b44-4c3a-488a-8d88-5696802400a8] Received unexpected event network-vif-unplugged-6724172f-0375-4ca7-8701-edcbe2f9d531 for instance with vm_state active and task_state resize_migrated.

Jul 3, 2023 @ 14:34:44.263nova-compute cloudvirt1028 WARNING -
nova.virt.libvirt.imagecache
[None req-11a52dbc-538c-4e11-9123-faf7bc33a31a - - - - - -] ephemeral_20_40d1d2c ephemeral image was used by instance but no back files existing!

Jul 3, 2023 @ 14:34:44.263nova-compute cloudvirt1028 WARNING -
nova.virt.libvirt.imagecache
[None req-11a52dbc-538c-4e11-9123-faf7bc33a31a - - - - - -] swap_8192 swap image was used by instance but no back files existing!

The last two message might be for a difference instance with libvirt image cache analyzing all instances around.. The ephemeral_20_40d1d2c comes from libvirt:

nova/virt/libvirt/driver.py:            fname = "ephemeral_%s_%s" % (ephemeral_gb, file_extension)
nova/virt/libvirt/driver.py:            fname = "ephemeral_%s_%s" % (eph['size'], file_extension)

The 20 would be a 20GBytes ephemeral disk but I created a flavor of 60G then asked to resize to 90G.


From reading the code, the method is something like move_and_resize and the instance has indeed been moved to a new host! cloudvirt1033.

Jul 3, 2023 @ 14:31:53.671nova-compute cloudvirt1033 WARNING 68ccf364-d3f1-4433-9cff-aa796acd4b0e
nova.compute.manager
[instance: bf359b44-4c3a-488a-8d88-5696802400a8] Received unexpected event network-vif-unplugged-6724172f-0375-4ca7-8701-edcbe2f9d531 for instance with vm_state active and task_state resize_migrated.

Jul 3, 2023 @ 14:31:58.774nova-compute cloudvirt1033 WARNING eb7a2b14-bcf5-446d-81ac-ab949c72f9f4
nova.compute.manager
[instance: bf359b44-4c3a-488a-8d88-5696802400a8] Received unexpected event network-vif-plugged-6724172f-0375-4ca7-8701-edcbe2f9d531 for instance with vm_state resized and task_state None.

That has the task_state resize_migrated and the instance bf359b44-4c3a-488a-8d88-5696802400a8 is the one that got migrated. It goes from task_state resize_migratedNone.

The libvirt cache manager indicates a 60G ephemeral image this time:

Jul 3, 2023 @ 14:38:22.827nova-compute cloudvirt1033 WARNING -
nova.virt.libvirt.imagecache
[None req-a9a5bd78-edfa-4c30-b239-607167af5acc - - - - - -] ephemeral_60_40d1d2c ephemeral image was used by instance but no back files existing!

From reading the code the ephemeral_60_40d1d2c comes from:

error_images = self.used_ephemeral_images - self.back_ephemeral_images       
for error_image in error_images:                                             
    LOG.warning('%s ephemeral image was used by instance'                    
                ' but no back files existing!', error_image)

Where:

used_ephemeral_imagesThe list of images used by running instances
back_ephemeral_imagesImages found on disk

It flags an error because the running instance has ephemeral_60_40d1d2c but it is not found on disk. So the resize is bugged I guess? The instance moved on the new host still refers to the old disk from the old host.

HostMetadataDisk file
cloudvirt1028ephemeral_60_40d1d2cephemeral_60_40d1d2c
cloudvirt1033ephemeral_60_40d1d2c (wrong)?????

And eventually I have found Launchpad 1558880 instance can not resize ephemeral from mitaka to stein and master which links to the introduction of a Nova API doc rendered at https://docs.openstack.org/api-ref/compute/?expanded=resize-server-resize-action-detail#resize-server-resize-action:

NOTE: There is a known limitation that ephemeral disks are not resized

An explanation from the bug:

The resize is done by the _create_image call.

For reference, the bug happened way before this in ComputeManager._finish_migration where we do:

       if old_instance_type_id != new_instance_type_id:
           instance_type = instance.get_flavor('new')
           self._set_instance_info(instance, instance_type)
           for key in ('root_gb', 'swap', 'ephemeral_gb'):
               if old_instance_type[key] != instance_type[key]:
                   resize_instance = True
                   break

The problem is that ephemeral disks are defined by BDMs, and not by `instance.ephemeral_gb`.
The above code updates `ephemeral_gb`, but not the BDM.
The libvirt driver is only looking in the BDM, so it doesn't see the resize.

The BDM stands for Block Device Mapping. Thus I guess the mapping still has ephemeral_60_40d1d2c from the source host and libvirt or nova does not bother creating it (I guess cause ephemeral_gb is set so it doesn't bother?).

Further below a fix got merged which is in the version of OpenStack we are running (zed) but that doesn't address our use case.

Anyway that is an Upstream bug. We can probably decline this task :\

I am pretty sure that can potentially be manually fixed up by deleting/resizing files /updating metadata somehow. But lets not bother investigating further.

The short story is OpenStack is unable to resize an Ephemeral instance due to https://bugs.launchpad.net/nova/+bug/1558880