Page MenuHomePhabricator

We almost never actually free up space from deleted VMs
Open, HighPublic

Description

It seems that once a backup has been created of a VM, ceph declines to delete the image when the VM is deleted due to associated snapshots.

Some solutions:

  • Add a cleanup step to our daily backup jobs (probably easiest)
  • Stop having backups (T289282) (even easier but not necessarily the wisest)
  • Stop persisting snapshots in between backup jobs (will probably increase backup storage considerably)
  • Hook nova or something so that we synchronously delete backups and snaps as part of VM deletion (hardest and most 'correct')

Event Timeline

I was having trouble reproducing this and then I remember that VMs in 'testlabs' aren't backed up.

So... for experimentation purposes I've created some test VMs in the 'tools' project: andrew-delete-test-[1-5].tools.eqiad1.wikimedia.cloud

Once they've been backed up we can try deleting them and see what the logs look like.

Here's a blog post that seems to match what we're seeing:

https://heiterbiswolkig.blogs.nde.ag/2019/03/07/orphaned-instances-part-2/

Does it suggest a solution? Not really!

I just now deleted andrew-delete-test-1.tools.eqiad1.wikimedia.cloud and watched the nova-compute and libvirt logs... there's no complaining there at all even though the image was leaked.

I'm starting to think that we may have to write a periodic cleanup job for this.

That does not seem horrible to me, serves as a "backup" if there was a fat finger delete.

nova/storage/rbd_utils.py has clear error handling for this in a couple of places, for example:

except rbd.ImageHasSnapshots:
    LOG.error('image %(volume)s in pool %(pool)s has '
              'snapshots, failed to remove',
              {'volume': name, 'pool': self.pool})

And yet I can't find that message appearing even once in our logs. So maybe I'm looking at the wrong code

Change 887789 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] wmcs-novastats-cephleaks.py: add 'delete' functionality

https://gerrit.wikimedia.org/r/887789

Change 887789 merged by Andrew Bogott:

[operations/puppet@production] wmcs-novastats-cephleaks.py: add 'delete' functionality

https://gerrit.wikimedia.org/r/887789

Change 888087 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] wmcs-novastats-cephleaks.py: remove a broken (and unneeded) output check.

https://gerrit.wikimedia.org/r/888087

Change 888087 merged by Andrew Bogott:

[operations/puppet@production] wmcs-novastats-cephleaks.py: remove a broken (and unneeded) output check.

https://gerrit.wikimedia.org/r/888087

I ran a limited version of the attached script, deleting 110 out of 1100 stray images. It freed up around 200GB of space on the ceph cluster.

I want to wait a while and make REALLY sure I didn't delete any real VMs by accident, and then I'll run the script on everything else.

Change 891560 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps: purge leaked VM images, daily

https://gerrit.wikimedia.org/r/891560

Change 891560 merged by Andrew Bogott:

[operations/puppet@production] cloud-vps: purge leaked VM images, daily

https://gerrit.wikimedia.org/r/891560