Page MenuHomePhabricator

Figure out how to delete Glance images
Open, MediumPublic

Description

Just now I'm trying to delete an unused glance image and it seems to have snaps mounted all over the place.

Perhaps openstack/kvm isn't unmounting the base image when we delete a VM?

root@cloudcontrol1003:~# openstack image delete f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37
Failed to delete image with name or ID 'f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37': ConflictException: 409: Client Error for url: http://openstack.eqiad1.wikimediacloud.org:9292/v2/images/f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37, 409 Conflict: Image f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37 could not be deleted because it is in use: The image cannot be deleted because it is in use through the backend store outside of Glance.
Failed to delete 1 of 1 images.
root@cloudcontrol1003:~# rbd snap purge eqiad1-glance-images/f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37
rbd: error removing snapshot(s) 'snap', which is protected - these must be unprotected with `rbd snap unprotect`.
Removing all snapshots: 0% complete...failed.
root@cloudcontrol1003:~# rbd snap unprotect eqiad1-glance-images/f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37@snap
2021-08-23T15:10:09.424+0000 7fb92b7fe700 -1 librbd::SnapshotUnprotectRequest: cannot unprotect: at least 10 child(ren) [09e13ba547f8e2,7c2c3446b30bca,8e8601b54d66b,8e862f78fe4308,9f8795bf4d84b1,bc5730a76bbb12,d8810b20de2f9c,e928e8afd3ffbb,f877e1df7019a6,fcb6328409e09b] in pool 'eqiad1-compute'
2021-08-23T15:10:09.424+0000 7fb92b7fe700 -1 librbd::SnapshotUnprotectRequest: encountered error: (16) Device or resource busy
2021-08-23T15:10:09.424+0000 7fb92b7fe700 -1 librbd::SnapshotUnprotectRequest: 0x562291277d10 should_complete_error: ret_val=-16
rbd: unprotecting snap failed: (16) Device or resource busy
2021-08-23T15:10:09.428+0000 7fb92b7fe700 -1 librbd::SnapshotUnprotectRequest: 0x562291277d10 should_complete_error: ret_val=-16
root@cloudcephmon1001:~# rbd --pool eqiad1-compute ls -l | grep f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37
0141d00a-027a-4733-ba3b-8530b434e8be_disk                                     20 GiB  eqiad1-glance-images/f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37@snap    2
0141d00a-027a-4733-ba3b-8530b434e8be_disk@2021-08-16T15:12:35_cloudvirt1023   20 GiB  eqiad1-glance-images/f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37@snap    2
1afcd63e-5077-4858-a8b3-c0d566de0a8a_disk                                     20 GiB  eqiad1-glance-images/f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37@snap    2
1afcd63e-5077-4858-a8b3-c0d566de0a8a_disk@2021-08-23T04:52:59_cloudvirt1024   20 GiB  eqiad1-glance-images/f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37@snap    2
4fb61804-2818-4ca9-8568-b6976dca9d12_disk                                     20 GiB  eqiad1-glance-images/f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37@snap    2
4fb61804-2818-4ca9-8568-b6976dca9d12_disk@2021-08-15T20:39:20_cloudvirt1021   20 GiB  eqiad1-glance-images/f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37@snap    2
65abf300-c1c8-408e-8738-083ceb1aef15_disk                                     20 GiB  eqiad1-glance-images/f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37@snap    2
65abf300-c1c8-408e-8738-083ceb1aef15_disk@2021-08-16T05:36:51_cloudvirt1024   20 GiB  eqiad1-glance-images/f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37@snap    2
6eb825b5-1f85-495a-a6ac-a9d018eabb0b_disk                                     20 GiB  eqiad1-glance-images/f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37@snap    2
6eb825b5-1f85-495a-a6ac-a9d018eabb0b_disk@2021-06-17T05:23:00_cloudvirt1024   20 GiB  eqiad1-glance-images/f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37@snap    2
7a60e48c-51d3-4df4-91eb-db026101eb98_disk                                     20 GiB  eqiad1-glance-images/f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37@snap    2
7a60e48c-51d3-4df4-91eb-db026101eb98_disk@2021-08-16T05:40:06_cloudvirt1024   20 GiB  eqiad1-glance-images/f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37@snap    2
863f578d-0e9e-4611-97b1-d38e05f8f3a2_disk                                     20 GiB  eqiad1-glance-images/f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37@snap    2
863f578d-0e9e-4611-97b1-d38e05f8f3a2_disk@2021-08-23T04:52:08_cloudvirt1024   20 GiB  eqiad1-glance-images/f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37@snap    2
9583389f-d204-4db4-afd8-d3268cdcc733_disk                                     20 GiB  eqiad1-glance-images/f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37@snap    2
9583389f-d204-4db4-afd8-d3268cdcc733_disk@2021-08-16T05:39:38_cloudvirt1024   20 GiB  eqiad1-glance-images/f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37@snap    2
abc825ce-ddf6-4303-a86a-d91e4c4aa193_disk                                     20 GiB  eqiad1-glance-images/f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37@snap    2
abc825ce-ddf6-4303-a86a-d91e4c4aa193_disk@2021-08-23T04:52:34_cloudvirt1024   20 GiB  eqiad1-glance-images/f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37@snap    2
fb348019-3d47-4d3a-bf22-1131ed29709d_disk                                     20 GiB  eqiad1-glance-images/f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37@snap    2
fb348019-3d47-4d3a-bf22-1131ed29709d_disk@2021-08-16T05:37:09_cloudvirt1024   20 GiB  eqiad1-glance-images/f99ec8f1-f7e7-4a8d-95c8-45f49a5fea37@snap    2

Event Timeline

I have a theory that this all relates to backups. Our backup process persists a snapshot of a given VM ceph image -- I bet that when Nova wants to delete the image for a VM it fails because there are pending snapshots of the image. After that, nova doesn't know about the VMs so never makes future attempts to clean things up so we're leaking things all over the place.

I'll do a few tests to see if I can confirm this.

Change 714427 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Added wmcs-novastats-cephleaks script

https://gerrit.wikimedia.org/r/714427

Attached script suggests that once a VM has been deleted we lose the ability to free up space after it's deleted. I'm hoping @dcaro can confirm my suspicion, at which point I'll adjust that script to actually delete leaked files (and then we'll figure out how to stop leaking them in the first place).

Change 714427 merged by Andrew Bogott:

[operations/puppet@production] Added wmcs-novastats-cephleaks script

https://gerrit.wikimedia.org/r/714427

There's two tasks related to this:
2021-08-11 00:00:37 unassigned :T273723(MOp)(Inbox)[backups] Periodically cleanup non-handled backups
2021-08-10 19:02:47 unassigned :T273720(MOp)(Inbox)[ceph][rbd] Periodically cleanup dangling snapshots

There's already a couple commands that implement some cleanup, though it was never setup as a cron:

root@cloudvirt1025:~# wmcs-backup instances --help
...
    remove-dangling-snapshots
                        Get and remove the rbd snapshots that don't have a
                        backup, for each vm backed up in this host. Note that
                        if there's a snapshot for a vm that is not backed up
                        in this host it will not be checked.
...
    remove-unhandled-backups
                        Remove any backups that don't match any handled VM.
                        This might happen if a machine is now backed up in
                        another host or was deleted.

Specially the latter would help cleaning up any leftover backups (note that it will only handle the snapshots that have the host you are running it on as suffix, ex. eqiad1-compute/863f578d-0e9e-4611-97b1-d38e05f8f3a2_disk@2021-08-23T04:52:08_cloudvirt1024 would only be cleaned up from cloudvirt1024, see a noop run here https://paste.toolforge.org/view/7d095f81). Though it will not remove the images as it is right now, just the snapshots.

Anyhow, trying to reproduce, I can get a similar behavior if I manually create a VM, and create a snapshot of the rbd volume, then deleting the VM "succeeds" but does not delete the rbd image nor the snapshot, should be easy to extend though.

And you can't remove an image until the snapshots are purged:

root@cloudcephmon1001:~# rbd --pool eqiad1-compute rm  81010fc1-5ff2-458d-813f-29f8dedd3f5a_disk
Removing image: 0% complete...failed.
rbd: image has snapshots - these must be deleted with 'rbd snap purge' before the image can be removed.

root@cloudcephmon1001:~# rbd --pool eqiad1-compute snap purge 81010fc1-5ff2-458d-813f-29f8dedd3f5a_disk
Removing all snapshots: 100% complete...done.

root@cloudcephmon1001:~# rbd --pool eqiad1-compute rm  81010fc1-5ff2-458d-813f-29f8dedd3f5a_disk
Removing image: 100% complete...done.

The suspicious logs I've found so far (kibana + journal on the cloudvirt) is:

Aug 24 09:17:39 cloudvirt1038 nova-compute[40719]: 2021-08-24T09:17:39.839+0000 7f69f1fec740 -1 asok(0x3834d70) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/guests/ceph-client.eqiad1-compute.40719.58950184.asok': (13) Permission denied

Those files are owned by "libvirt-qemu" user and group, but I'm suspecting that the python process was nova, so nova user, and that user is part of the group "libvirt" but not "libvirt-qemu".
Don't know if that's related but might be something to look at.

I'm having trouble trying to find any more logs about it, will update if I find anything more.

that particular image is now gone, but only because I deleted all related snapshots by hand. We need some kind of system (or to abandon backups)

nskaggs triaged this task as Medium priority.Aug 27 2021, 2:35 PM