We almost never actually free up space from deleted VMs
Open, HighPublic
Actions

Assigned To

None

Authored By

	Andrew
	Aug 24 2021, 8:19 PM

Description

It seems that once a backup has been created of a VM, ceph declines to delete the image when the VM is deleted due to associated snapshots.

Some solutions:

Add a cleanup step to our daily backup jobs (probably easiest)
Stop having backups (T289282) (even easier but not necessarily the wisest)
Stop persisting snapshots in between backup jobs (will probably increase backup storage considerably)
Hook nova or something so that we synchronously delete backups and snaps as part of VM deletion (hardest and most 'correct')

Details

Subject	Repo	Branch	Lines +/-
cloud-vps: purge leaked VM images, daily	operations/puppet	production	+19 -1
wmcs-novastats-cephleaks.py: remove a broken (and unneeded) output check.	operations/puppet	production	+1 -4
wmcs-novastats-cephleaks.py: add 'delete' functionality	operations/puppet	production	+32 -4

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T289502 Figure out how to delete Glance images
		Open		None	T289623 We almost never actually free up space from deleted VMs

Event Timeline

Andrew created this task.Aug 24 2021, 8:19 PM

• nskaggs triaged this task as High priority.Aug 27 2021, 2:36 PM

I was having trouble reproducing this and then I remember that VMs in 'testlabs' aren't backed up.

So... for experimentation purposes I've created some test VMs in the 'tools' project: andrew-delete-test-[1-5].tools.eqiad1.wikimedia.cloud

Once they've been backed up we can try deleting them and see what the logs look like.

Here's a blog post that seems to match what we're seeing:

https://heiterbiswolkig.blogs.nde.ag/2019/03/07/orphaned-instances-part-2/

Does it suggest a solution? Not really!

I just now deleted andrew-delete-test-1.tools.eqiad1.wikimedia.cloud and watched the nova-compute and libvirt logs... there's no complaining there at all even though the image was leaked.

I'm starting to think that we may have to write a periodic cleanup job for this.

That does not seem horrible to me, serves as a "backup" if there was a fat finger delete.

nova/storage/rbd_utils.py has clear error handling for this in a couple of places, for example:

except rbd.ImageHasSnapshots:
    LOG.error('image %(volume)s in pool %(pool)s has '
              'snapshots, failed to remove',
              {'volume': name, 'pool': self.pool})

And yet I can't find that message appearing even once in our logs. So maybe I'm looking at the wrong code

created upstream bug https://bugs.launchpad.net/nova/+bug/1975637

fnegri edited projects, added cloud-services-team; removed cloud-services-team (Kanban).Jan 18 2023, 6:46 PM

fnegri moved this task from Kanban to Inbox on the cloud-services-team board.

Change 887789 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] wmcs-novastats-cephleaks.py: add 'delete' functionality

https://gerrit.wikimedia.org/r/887789

gerritbot added a project: Patch-For-Review.Feb 8 2023, 2:35 PM

Change 887789 merged by Andrew Bogott:

[operations/puppet@production] wmcs-novastats-cephleaks.py: add 'delete' functionality

https://gerrit.wikimedia.org/r/887789

Maintenance_bot removed a project: Patch-For-Review.Feb 9 2023, 8:30 PM

Change 888087 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] wmcs-novastats-cephleaks.py: remove a broken (and unneeded) output check.

https://gerrit.wikimedia.org/r/888087

gerritbot added a project: Patch-For-Review.Feb 9 2023, 9:05 PM

Change 888087 merged by Andrew Bogott:

[operations/puppet@production] wmcs-novastats-cephleaks.py: remove a broken (and unneeded) output check.

https://gerrit.wikimedia.org/r/888087

Maintenance_bot removed a project: Patch-For-Review.Feb 9 2023, 9:31 PM

I ran a limited version of the attached script, deleting 110 out of 1100 stray images. It freed up around 200GB of space on the ceph cluster.

I want to wait a while and make REALLY sure I didn't delete any real VMs by accident, and then I'll run the script on everything else.

I see the upstream patch is still pending. https://review.opendev.org/c/openstack/nova/+/843228. Perhaps needs a nudge.

Change 891560 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps: purge leaked VM images, daily

https://gerrit.wikimedia.org/r/891560

gerritbot added a project: Patch-For-Review.Feb 23 2023, 2:37 PM

Change 891560 merged by Andrew Bogott: