Page MenuHomePhabricator

[ceph] Make sure rbd snapshots are being cleaned up
Closed, ResolvedPublic

Description

We are quickly exhausting the space available on the ceph cluster:

20201218_11h29m27s_grim.png (498×914 px, 57 KB)

grafana link

And it seems that we are not removing snapshots after getting backups (or something similar).

Investigate and sort out if needed.

Event Timeline

dcaro triaged this task as High priority.Dec 18 2020, 10:31 AM

Mentioned in SAL (#wikimedia-cloud) [2020-12-18T10:33:02Z] <dcaro> purging rbd snapshots for image fc6fb78b-4515-4dcc-8254-591b9fe01762 (T270478)

Mentioned in SAL (#wikimedia-cloud) [2020-12-18T16:21:04Z] <dcaro> removing dangling rbd snapshots (for backups on cloudvirt1024) (T270478)

Mentioned in SAL (#wikimedia-cloud) [2020-12-18T16:47:52Z] <dcaro> finished cleaning up the dangling snapshots from cloudvirt1024, freed ~12% of the capacity (T270478)

Mentioned in SAL (#wikimedia-cloud) [2020-12-18T16:51:34Z] <dcaro> removing dangling rbd snapshots (for backups on cloudvirt1023) (T270478)

Mentioned in SAL (#wikimedia-cloud) [2020-12-18T16:54:19Z] <dcaro> finished cleaning up the dangling snapshots from cloudvirt1023 (T270478)

Mentioned in SAL (#wikimedia-cloud) [2020-12-18T16:55:36Z] <dcaro> removing dangling rbd snapshots (for backups on cloudvirt1022) (T270478)

Mentioned in SAL (#wikimedia-cloud) [2020-12-18T16:56:57Z] <dcaro> finished cleaning up the dangling snapshots from cloudvirt1022 (T270478)

Mentioned in SAL (#wikimedia-cloud) [2020-12-18T16:58:44Z] <dcaro> removing dangling rbd snapshots (for backups on cloudvirt1021) (T270478)

Mentioned in SAL (#wikimedia-cloud) [2020-12-18T17:00:40Z] <dcaro> finished cleaning up the dangling snapshots from cloudvirt1021 (T270478)

Mentioned in SAL (#wikimedia-cloud) [2020-12-18T17:05:18Z] <dcaro> removing dangling rbd snapshots (for backups on cloudvirt1025) (T270478)

Mentioned in SAL (#wikimedia-cloud) [2020-12-18T17:06:19Z] <dcaro> finished cleaning up the dangling snapshots from cloudvirt1025 (T270478)

Mentioned in SAL (#wikimedia-cloud) [2020-12-18T17:08:57Z] <dcaro> removing dangling rbd snapshots (for backups on cloudvirt1026) (T270478)

Mentioned in SAL (#wikimedia-cloud) [2020-12-18T17:09:29Z] <dcaro> finished cleaning up the dangling snapshots from cloudvirt1026 (T270478)

Change 650535 had a related patch set uploaded (by David Caro; owner: David Caro):
[operations/puppet@production] [wmcs][backup] Add command to remove/print dangling snapshots

https://gerrit.wikimedia.org/r/650535

Change 650542 had a related patch set uploaded (by David Caro; owner: David Caro):
[operations/puppet@production] [wmcs][backup] Remove all temp files after usage

https://gerrit.wikimedia.org/r/650542

Mentioned in SAL (#wikimedia-cloud) [2020-12-22T15:30:05Z] <dcaro> cleaning up 6778 dangling snapshots for glance images in eqiad (T270478)

For some reason, cloudcontrol1003 is several orders of magnitude slower than cloudcontrol1005 doing image backups... will check

Change 654266 had a related patch set uploaded (by David Caro; owner: David Caro):
[operations/puppet@production] wmcs.backup: Add backup_image command

https://gerrit.wikimedia.org/r/654266

Change 650535 merged by David Caro:
[operations/puppet@production] wmcs.backup: Add command to remove/print dangling snapshots

https://gerrit.wikimedia.org/r/650535

Change 650542 merged by David Caro:
[operations/puppet@production] wmcs.backup: Remove all temp files after usage

https://gerrit.wikimedia.org/r/650542

Change 654266 merged by David Caro:
[operations/puppet@production] wmcs.backup: Add backup_image command

https://gerrit.wikimedia.org/r/654266

Change 654898 had a related patch set uploaded (by David Caro; owner: David Caro):
[operations/puppet@production] wmcs.backups: replaces the image script with new one

https://gerrit.wikimedia.org/r/654898

After checking backy2, even when the rbd diff is empty, it will still write the metadata for each block in the next backup anyhow, and that's what's slowing down the backups (considerably). We are using a sqlite backend (locally to the host), so they use IO from the disk. I fooled around with skipping the metadata updates but it's not as easy as just deactivating them (the backups become a matter of a second though xd). So will probably go around and try to skip the backup if the diff is empty...

Hmmm, cloudcontrol1005 takes 30s to do a backup, cloudcontrol1003 takes 10min... checking the setup both have raid10 over 5 partitions, but control1003 seems to have spinning disks, that might be the cause of the metadata updates being sooo slow

Will make backups only on cloudcontrol1005, take the pressure out of the other non-ssd ones

Change 654898 merged by David Caro:
[operations/puppet@production] wmcs.backups: replaces the image script with new one

https://gerrit.wikimedia.org/r/654898

Change 655095 had a related patch set uploaded (by David Caro; owner: David Caro):
[operations/puppet@production] wmcs.backup_glance_images: disable the backups on 1003 and 1004

https://gerrit.wikimedia.org/r/655095

Manually triggered the systemd timer on cloudcontrol1005, everything ok, will leave this task open to check next week see if there's any leaked snapshots.

Mentioned in SAL (#wikimedia-cloud) [2021-01-11T08:39:48Z] <dcaro> cleaning up dangling snapshots now that we have the new suffixed ones (T270478)

Mentioned in SAL (#wikimedia-cloud) [2021-01-11T09:19:00Z] <dcaro> cleaned up ~1800 snapshots, 109 remaining only, one for each host x image combination (plus some ephemeral ones while doing backups), closing the task (T270478)

Change 655095 merged by David Caro:
[operations/puppet@production] wmcs.backup_glance_images: disable the backups on 1003 and 1004

https://gerrit.wikimedia.org/r/655095