Page MenuHomePhabricator

We almost never actually free up space from deleted VMs
Open, HighPublic

Description

It seems that once a backup has been created of a VM, ceph declines to delete the image when the VM is deleted due to associated snapshots.

Some solutions:

  • Add a cleanup step to our daily backup jobs (probably easiest)
  • Stop having backups (T289282) (even easier but not necessarily the wisest)
  • Stop persisting snapshots in between backup jobs (will probably increase backup storage considerably)
  • Hook nova or something so that we synchronously delete backups and snaps as part of VM deletion (hardest and most 'correct')

Event Timeline

nskaggs triaged this task as High priority.Aug 27 2021, 2:36 PM

I was having trouble reproducing this and then I remember that VMs in 'testlabs' aren't backed up.

So... for experimentation purposes I've created some test VMs in the 'tools' project: andrew-delete-test-[1-5].tools.eqiad1.wikimedia.cloud

Once they've been backed up we can try deleting them and see what the logs look like.

Here's a blog post that seems to match what we're seeing:

https://heiterbiswolkig.blogs.nde.ag/2019/03/07/orphaned-instances-part-2/

Does it suggest a solution? Not really!

I just now deleted andrew-delete-test-1.tools.eqiad1.wikimedia.cloud and watched the nova-compute and libvirt logs... there's no complaining there at all even though the image was leaked.

I'm starting to think that we may have to write a periodic cleanup job for this.

That does not seem horrible to me, serves as a "backup" if there was a fat finger delete.

nova/storage/rbd_utils.py has clear error handling for this in a couple of places, for example:

except rbd.ImageHasSnapshots:
    LOG.error('image %(volume)s in pool %(pool)s has '
              'snapshots, failed to remove',
              {'volume': name, 'pool': self.pool})

And yet I can't find that message appearing even once in our logs. So maybe I'm looking at the wrong code