Page MenuHomePhabricator

[cinder] [toolsdb] Deleting snapshot does not work
Closed, ResolvedPublicBUG REPORT

Description

While testing the procedure to create a new ToolsDB replica I encountered a problem where a Cinder snapshot cannot be deleted.

I created tools-db-2-snap, created a Cinder volume from it, then deleted that volume, and tried to delete the snapshot.

I tried both from Horizon and from the CLI:

sudo wmcs-openstack volume snapshot delete f46d30ca-e655-4892-91da-473ccb60bfd4

In both cases, nothing happens, and the snapshot remains in status "available", both according to the Openstack CLI and to the rbd CLI:

root@cloudcontrol1005:~# wmcs-openstack volume snapshot show f46d30ca-e655-4892-91da-473ccb60bfd4
+--------------------------------------------+--------------------------------------+
| Field                                      | Value                                |
+--------------------------------------------+--------------------------------------+
| created_at                                 | 2024-02-06T13:17:12.000000           |
| description                                | T344717                              |
| group_snapshot_id                          | None                                 |
| id                                         | f46d30ca-e655-4892-91da-473ccb60bfd4 |
| name                                       | tools-db-2-snap                      |
| os-extended-snapshot-attributes:progress   | 100%                                 |
| os-extended-snapshot-attributes:project_id | tools                                |
| properties                                 |                                      |
| size                                       | 4000                                 |
| status                                     | available                            |
| updated_at                                 | 2024-02-07T17:08:28.000000           |
| user_id                                    | fnegri                               |
| volume_id                                  | e25dae8a-803a-4b62-aa0c-bdf6ff481869 |
+--------------------------------------------+--------------------------------------+
root@cloudcontrol1005:~# rbd snap ls eqiad1-cinder/volume-e25dae8a-803a-4b62-aa0c-bdf6ff481869
SNAPID  NAME                                           SIZE     PROTECTED  TIMESTAMP
 77954  2024-02-05T09:36:47_cloudbackup2002            3.9 TiB             Mon Feb  5 09:36:54 2024
 78436  snapshot-f46d30ca-e655-4892-91da-473ccb60bfd4  3.9 TiB  yes        Tue Feb  6 13:17:13 2024

Note that RBD indicates the snapshot is protected, because it was used to create a child volume, but that child volume is now gone.

root@cloudcontrol1005:~# rbd children eqiad1-cinder/volume-e25dae8a-803a-4b62-aa0c-bdf6ff481869@snapshot-f46d30ca-e655-4892-91da-473ccb60bfd4
# empty output

Event Timeline

Around the time I first tried to delete this snapshot (today around 14:45 UTC), we started having issues on the Ceph cluster: T334240: [cloudceph] Slow operations - tracking task

This might or might not be related. One hypothesis is that deleting the snapshot triggers snaptrim operations in the cluster, and that is known to be a possible cause for cluster slowness:

fnegri changed the task status from Open to In Progress.Feb 7 2024, 6:42 PM
fnegri claimed this task.
fnegri triaged this task as High priority.
fnegri updated the task description. (Show Details)
fnegri moved this task from Backlog to ToolsDB on the Data-Services board.
fnegri added a project: Cloud-VPS.
fnegri renamed this task from [toolsdb] Deleting snapshot does not work to [toolsdb] [cinder] [ceph] Deleting snapshot does not work.Feb 7 2024, 6:46 PM

I think it's likely related though, it's quite possible that doing the snaptrim forces ceph to increase the IO on sectors of the problematic disks, triggering the slow ops, and cascading to NFS VM disk hanging.

24 hours later, the snapshot has not been deleted yet. So either the snaptrim is taking a really long time, or it didn't even start. I'm struggling to find any log confirming if the snap deletion ever started.

I tried reproducing this on a small empty volume, and I could create and delete a snapshot without issues. So I think the issue only happens when snapshotting a big volume, or probably only with a big volume and a high rate of changes. In this case deleting the snapshot will trigger a larger snaptrim operation, because the volume contents have significantly changed after the snapshot was taken.

From logstash

Delete snapshot failed, due to snapshot busy.: cinder.exception.SnapshotIsBusy: deleting snapshot snapshot-f46d30ca-e655-4892-91da-473ccb60bfd4 that has dependent volumes

This is exactly the behavior described in this ancient bug: https://bugzilla.redhat.com/show_bug.cgi?id=1795959

I suspect that the current issue is that the volume that was created and deleted is still in the trash. I cannot empty the trash, though:

root@cloudcontrol1007:~# rbd trash purge --pool eqiad1-cinder
Removing images: 0% complete...failed.
rbd: some expired images could not be removed
Ensure that they are closed/unmapped, do not have snapshots (including trashed snapshots with linked clones), are not in a group and were moved to the trash successfully.
# rbd children -a --pool eqiad1-cinder volume-e25dae8a-803a-4b62-aa0c-bdf6ff481869@snapshot-f46d30ca-e655-4892-91da-473ccb60bfd4
eqiad1-cinder/volume-80faf64d-c623-4965-af7c-9ea96103f39f (trash 8626bb8cb96871)

# rbd trash restore --pool eqiad1-cinder 8626bb8cb96871

# rbd snap ls --pool eqiad1-cinder volume-80faf64d-c623-4965-af7c-9ea96103f39f
SNAPID  NAME                                 SIZE     PROTECTED  TIMESTAMP               
 78911  2024-02-07T12:34:54_cloudbackup2002  3.9 TiB             Wed Feb  7 12:34:59 2024

# rbd snap purge --pool eqiad1-cinder volume-80faf64d-c623-4965-af7c-9ea96103f39f
Removing all snapshots: 100% complete...done.

# # finally it deleted something! We found the end of the chain

# rbd rm --pool eqiad1-cinder volume-80faf64d-c623-4965-af7c-9ea96103f39f
Removing image: 100% complete...done.

# openstack volume snapshot delete f46d30ca-e655-4892-91da-473ccb60bfd4

# root@cloudcontrol1007:~# openstack volume snapshot show f46d30ca-e655-4892-91da-473ccb60bfd4
No snapshot with a name or ID of 'f46d30ca-e655-4892-91da-473ccb60bfd4' exists.

# # \o/

This is pretty clearly a bug with cinder or rbd, either with how things get moved to the trash, or with how the trash itself is managed.

I still can't actually empty the trash, so there are more cases like this. One quick fix for us might be to just disable the trash feature in rbd.

Change 999218 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Add wmcs-empty-rbd-trash script

https://gerrit.wikimedia.org/r/999218

I reproduced the problem again and I understood better what's happening:

  • I created a snapshot from the volume tools-db-1: snapshot-75312b57-c818-4164-a91a-6538d7a3e17d
  • I created a new volume tools-db-1-data-temp based on that snapshot: volume-32eb4f7c-f250-46c4-bcdc-cb7e5d2d12a2
  • I extracted the data from that volume following my procedure for a new toolsdb replica, which takes a few hours
  • backy2 found my temp volume and created a snapshot of it: 2024-02-14T07:00:13_cloudbackup2002
  • Once I finished extracting the data from it, I deleted my temp volume from Horizon
  • Cinder moved the volume to the trash (possibly as a fallback because it was unable to delete it?), but its backy2-created snapshot still exists
  • My initial snapshot snapshot-75312b57-c818-4164-a91a-6538d7a3e17d cannot be deleted because its child (volume-32eb4f7c-f250-46c4-bcdc-cb7e5d2d12a2) is still in the trash
  • volume-32eb4f7c-f250-46c4-bcdc-cb7e5d2d12a2 cannot be deleted because it has a snapshot (2024-02-14T07:00:13_cloudbackup2002)
  • I restored volume-32eb4f7c-f250-46c4-bcdc-cb7e5d2d12a2 from the trash, deleted the backy2-created snapshot, deleted the volume, and finally I could delete my initial snapshot

One way to unblock T344717 would be to disable backups on volumes with a name ending in temp, e.g. adding a regex "temp$" in modules/profile/templates/wmcs/backy2/wmcs_backup_volumes.yaml.erb.

More generally, I wonder if all the volumes that are now in the rbd trash (rbd trash ls --all --long shows 123 expired volumes) are stuck because of a similar scenario where backy2 created snapshots that were never removed. This problem seems to have started around August 2023, the oldest image in the trash is:

c696567d361282  volume-262bdfea-3875-48e3-a7de-a8912606caaf  USER    Tue Aug 29 14:34:13 2023  expired at Tue Aug 29 14:34:13 2023

Backy2 always persists one snapshot for each volume in order to do incremental backups. So as far as I can tell we will have this issue (or a similar one) with every volume that is ever backed up.

For VMs, I proposed this patch (https://review.opendev.org/c/openstack/nova/+/843228) but rather than hack our nova install locally we run a daily cron to clean things up. Fixes like that (either a cron, or a patch to cinder) could also work for the equivalent cinder issue. I'm at a loss for a 'correct' solution to this though, apart from giving up on incremental backups entirely.

Simplified scenario:

  • create a new volume A
  • wait for the wmcs-backup script to create a snapshot and a backup
  • delete volume A: Cinder cannot delete it because it has a snapshot, and moves it to the RBD trash
  • wmcs-backup should clean up the snapshot but it doesn't

This scenario is also causing an uncaught exception in wmcs-backup.

I think this happens because backup_cinder_volumes.timer and remove_dangling_cinder_snapshots.timer are both started at the same time (07:00).

This results in backup_cinder_volumes crashing and not creating backups for that day.

1fnegri@cloudbackup2002:~$ journalctl --since "2024-02-22 07:00:00" -u remove_dangling_cinder_snapshots.service
2[...]
3Feb 22 07:00:00 cloudbackup2002 systemd[1]: Starting backup cinder volumes...
4Feb 22 07:00:04 cloudbackup2002 wmcs-backup[1817835]: WARNING:[2024-02-22 07:00:04,018] Unable to find image with id ab37537e-8b8d-4c2c-bfea-990cdf604e31
5Feb 22 07:00:04 cloudbackup2002 wmcs-backup[1817835]: WARNING:[2024-02-22 07:00:04,019] Unable to find image with id ab37537e-8b8d-4c2c-bfea-990cdf604e31
6Feb 22 07:00:04 cloudbackup2002 wmcs-backup[1817835]: WARNING:[2024-02-22 07:00:04,019] Unable to find image with id ab37537e-8b8d-4c2c-bfea-990cdf604e31
7Feb 22 07:00:04 cloudbackup2002 wmcs-backup[1817835]: WARNING:[2024-02-22 07:00:04,019] Unable to find image with id ab37537e-8b8d-4c2c-bfea-990cdf604e31
8Feb 22 07:00:04 cloudbackup2002 wmcs-backup[1817835]: WARNING:[2024-02-22 07:00:04,019] Unable to find image with id ab37537e-8b8d-4c2c-bfea-990cdf604e31
9Feb 22 07:00:04 cloudbackup2002 wmcs-backup[1817835]: WARNING:[2024-02-22 07:00:04,019] Unable to find image with id ab37537e-8b8d-4c2c-bfea-990cdf604e31
10Feb 22 07:00:04 cloudbackup2002 wmcs-backup[1817835]: WARNING:[2024-02-22 07:00:04,019] Unable to find image with id ab37537e-8b8d-4c2c-bfea-990cdf604e31
11Feb 22 07:00:04 cloudbackup2002 wmcs-backup[1817835]: WARNING:[2024-02-22 07:00:04,020] Unable to find image with id ab37537e-8b8d-4c2c-bfea-990cdf604e31
12Feb 22 07:00:04 cloudbackup2002 wmcs-backup[1817835]: WARNING:[2024-02-22 07:00:04,020] Unable to find image with id ab37537e-8b8d-4c2c-bfea-990cdf604e31
13Feb 22 07:00:04 cloudbackup2002 wmcs-backup[1817835]: INFO:[2024-02-22 07:00:04,021] Searching for unhandled backups...
14Feb 22 07:00:05 cloudbackup2002 wmcs-backup[1817835]: INFO:[2024-02-22 07:00:05,202] Removing unhandled image backups:
15Feb 22 07:00:05 cloudbackup2002 wmcs-backup[1817835]: Image: no_name(id:ab37537e-8b8d-4c2c-bfea-990cdf604e31)
16Feb 22 07:00:05 cloudbackup2002 wmcs-backup[1817835]: Total Size: 23040MB
17Feb 22 07:00:05 cloudbackup2002 wmcs-backup[1817835]: Backups:
18Feb 22 07:00:05 cloudbackup2002 wmcs-backup[1817835]: FULL created:2024-02-13 12:35:58 expires:2024-02-13 12:35:58 VALID PROTECTED size:2560MB name:volume-a>
19Feb 22 07:00:05 cloudbackup2002 wmcs-backup[1817835]: FULL created:2024-02-14 08:14:35 expires:2024-02-14 08:14:35 VALID PROTECTED size:2560MB name:volume-a>
20Feb 22 07:00:05 cloudbackup2002 wmcs-backup[1817835]: FULL created:2024-02-15 07:00:21 expires:2024-02-15 07:00:21 VALID PROTECTED size:2560MB name:volume-a>
21Feb 22 07:00:05 cloudbackup2002 wmcs-backup[1817835]: FULL created:2024-02-16 20:03:41 expires:2024-02-16 20:03:41 VALID PROTECTED size:2560MB name:volume-a>
22Feb 22 07:00:05 cloudbackup2002 wmcs-backup[1817835]: FULL created:2024-02-17 07:45:18 expires:2024-02-17 07:45:18 VALID PROTECTED size:2560MB name:volume-a>
23Feb 22 07:00:05 cloudbackup2002 wmcs-backup[1817835]: FULL created:2024-02-18 07:19:30 expires:2024-02-18 07:19:30 VALID PROTECTED size:2560MB name:volume-a>
24Feb 22 07:00:05 cloudbackup2002 wmcs-backup[1817835]: FULL created:2024-02-19 07:24:19 expires:2024-02-19 07:24:19 VALID PROTECTED size:2560MB name:volume-a>
25Feb 22 07:00:05 cloudbackup2002 wmcs-backup[1817835]: FULL created:2024-02-20 07:00:20 expires:2024-02-20 07:00:20 VALID PROTECTED size:2560MB name:volume-a>
26Feb 22 07:00:05 cloudbackup2002 wmcs-backup[1817835]: FULL created:2024-02-21 07:07:39 expires:2024-02-21 07:07:39 VALID PROTECTED size:2560MB name:volume-a>
27Feb 22 07:00:06 cloudbackup2002 wmcs-backup[1817971]: rbd: error opening image volume-ab37537e-8b8d-4c2c-bfea-990cdf604e31: (2) No such file or directory
28Feb 22 07:00:08 cloudbackup2002 wmcs-backup[1818126]: rbd: error opening image volume-ab37537e-8b8d-4c2c-bfea-990cdf604e31: (2) No such file or directory
29Feb 22 07:00:10 cloudbackup2002 wmcs-backup[1818176]: rbd: error opening image volume-ab37537e-8b8d-4c2c-bfea-990cdf604e31: (2) No such file or directory
30Feb 22 07:00:12 cloudbackup2002 wmcs-backup[1818210]: rbd: error opening image volume-ab37537e-8b8d-4c2c-bfea-990cdf604e31: (2) No such file or directory
31Feb 22 07:00:14 cloudbackup2002 wmcs-backup[1818280]: rbd: error opening image volume-ab37537e-8b8d-4c2c-bfea-990cdf604e31: (2) No such file or directory
32Feb 22 07:00:16 cloudbackup2002 wmcs-backup[1819112]: rbd: error opening image volume-ab37537e-8b8d-4c2c-bfea-990cdf604e31: (2) No such file or directory
33Feb 22 07:00:18 cloudbackup2002 wmcs-backup[1819372]: rbd: error opening image volume-ab37537e-8b8d-4c2c-bfea-990cdf604e31: (2) No such file or directory
34Feb 22 07:00:20 cloudbackup2002 wmcs-backup[1819460]: rbd: error opening image volume-ab37537e-8b8d-4c2c-bfea-990cdf604e31: (2) No such file or directory
35Feb 22 07:00:22 cloudbackup2002 wmcs-backup[1819508]: rbd: error opening image volume-ab37537e-8b8d-4c2c-bfea-990cdf604e31: (2) No such file or directory
36Feb 22 07:00:23 cloudbackup2002 wmcs-backup[1817835]: INFO:[2024-02-22 07:00:23,594] Removed 9 backups.
37Feb 22 07:00:23 cloudbackup2002 wmcs-backup[1817835]: INFO:[2024-02-22 07:00:23,595] Cleaning up leftover backy blocks (this frees the space)...
38Feb 22 10:09:44 cloudbackup2002 wmcs-backup[1817835]: Unhandled image ab37537e-8b8d-4c2c-bfea-990cdf604e31
39Feb 22 10:11:13 cloudbackup2002 wmcs-backup[1843503]: INFO:[2024-02-22 10:11:13,586] Removing snapshot eqiad1-cinder/volume-5510b9a4-0498-4bcb-b044-0b0d80e99ada@sna>
40Feb 22 10:11:13 cloudbackup2002 wmcs-backup[1843503]: INFO:[2024-02-22 10:11:13,588] Snapshot eqiad1-cinder/volume-5510b9a4-0498-4bcb-b044-0b0d80e99ada@snapshot-c10>
41Feb 22 10:11:13 cloudbackup2002 wmcs-backup[1843503]: INFO:[2024-02-22 10:11:13,589] Removing snapshot eqiad1-cinder/volume-77f6c073-f48d-4651-824e-f6a7e06bef9c@sna>
42Feb 22 10:11:13 cloudbackup2002 wmcs-backup[1843503]: INFO:[2024-02-22 10:11:13,589] Snapshot eqiad1-cinder/volume-77f6c073-f48d-4651-824e-f6a7e06bef9c@snapshot-366>
43Feb 22 10:11:13 cloudbackup2002 wmcs-backup[1843503]: INFO:[2024-02-22 10:11:13,589] Removing snapshot eqiad1-cinder/volume-77f6c073-f48d-4651-824e-f6a7e06bef9c@sna>
44Feb 22 10:11:13 cloudbackup2002 wmcs-backup[1843503]: INFO:[2024-02-22 10:11:13,589] Snapshot eqiad1-cinder/volume-77f6c073-f48d-4651-824e-f6a7e06bef9c@snapshot-5d0>
45Feb 22 10:11:13 cloudbackup2002 wmcs-backup[1843503]: INFO:[2024-02-22 10:11:13,589] Removing snapshot eqiad1-cinder/volume-9583ba88-b836-4233-8e51-6213d403f9c7@sna>
46Feb 22 10:11:13 cloudbackup2002 wmcs-backup[1843503]: INFO:[2024-02-22 10:11:13,589] Snapshot eqiad1-cinder/volume-9583ba88-b836-4233-8e51-6213d403f9c7@snapshot-9d5>
47Feb 22 10:11:13 cloudbackup2002 wmcs-backup[1843503]: INFO:[2024-02-22 10:11:13,589] Removed 0 snapshots, 0 failed.
48Feb 22 10:11:13 cloudbackup2002 wmcs-backup[1843503]: None
49Feb 22 10:11:13 cloudbackup2002 systemd[1]: remove_dangling_cinder_snapshots.service: Succeeded.
50Feb 22 10:11:13 cloudbackup2002 systemd[1]: Finished backup cinder volumes.
51Feb 22 10:11:13 cloudbackup2002 systemd[1]: remove_dangling_cinder_snapshots.service: Consumed 5min 25.925s CPU time.

1fnegri@cloudbackup2002:~$ journalctl --since "2024-02-22 07:00:00" -u backup_cinder_volumes.service
2[...]
3Feb 22 07:00:00 cloudbackup2002 systemd[1]: Starting backup cinder volumes...
4Feb 22 07:00:03 cloudbackup2002 wmcs-backup[1817832]: WARNING:[2024-02-22 07:00:03,720] Unable to find image with id ab37537e-8b8d-4c2c-bfea-990cdf604e31
5Feb 22 07:00:03 cloudbackup2002 wmcs-backup[1817832]: WARNING:[2024-02-22 07:00:03,721] Unable to find image with id ab37537e-8b8d-4c2c-bfea-990cdf604e31
6Feb 22 07:00:03 cloudbackup2002 wmcs-backup[1817832]: WARNING:[2024-02-22 07:00:03,721] Unable to find image with id ab37537e-8b8d-4c2c-bfea-990cdf604e31
7Feb 22 07:00:03 cloudbackup2002 wmcs-backup[1817832]: WARNING:[2024-02-22 07:00:03,721] Unable to find image with id ab37537e-8b8d-4c2c-bfea-990cdf604e31
8Feb 22 07:00:03 cloudbackup2002 wmcs-backup[1817832]: WARNING:[2024-02-22 07:00:03,722] Unable to find image with id ab37537e-8b8d-4c2c-bfea-990cdf604e31
9Feb 22 07:00:03 cloudbackup2002 wmcs-backup[1817832]: WARNING:[2024-02-22 07:00:03,722] Unable to find image with id ab37537e-8b8d-4c2c-bfea-990cdf604e31
10Feb 22 07:00:03 cloudbackup2002 wmcs-backup[1817832]: WARNING:[2024-02-22 07:00:03,722] Unable to find image with id ab37537e-8b8d-4c2c-bfea-990cdf604e31
11Feb 22 07:00:03 cloudbackup2002 wmcs-backup[1817832]: WARNING:[2024-02-22 07:00:03,722] Unable to find image with id ab37537e-8b8d-4c2c-bfea-990cdf604e31
12Feb 22 07:00:03 cloudbackup2002 wmcs-backup[1817832]: WARNING:[2024-02-22 07:00:03,722] Unable to find image with id ab37537e-8b8d-4c2c-bfea-990cdf604e31
13Feb 22 08:10:08 cloudbackup2002 wmcs-backup[1828313]: rbd: error opening image volume-ab37537e-8b8d-4c2c-bfea-990cdf604e31: (2) No such file or directory
14Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1828333]: ERROR: [backy2.logging] 'Version 70389388-ca6c-11ee-9579-c3d729b71e83 not found.'
15Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1828333]: ERROR: [backy2.logging] Backy failed.
16Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1817832]: Traceback (most recent call last):
17Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1817832]: File "/usr/local/sbin/wmcs-backup", line 2116, in <module>
18Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1817832]: args.func()
19Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1817832]: File "/usr/local/sbin/wmcs-backup", line 1986, in <lambda>
20Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1817832]: get_current_volumes_state(from_cache=args.from_cache).delete_expired(noop=args.noop)
21Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1817832]: File "/usr/local/sbin/wmcs-backup", line 1326, in delete_expired
22Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1817832]: image_backup.delete_expired(noop=noop)
23Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1817832]: File "/usr/local/sbin/wmcs-backup", line 306, in delete_expired
24Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1817832]: self.remove_backup(backup=backup, noop=noop)
25Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1817832]: File "/usr/local/sbin/wmcs-backup", line 460, in remove_backup
26Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1817832]: backup.remove(pool=self.config.ceph_pool, noop=noop)
27Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1817832]: File "/usr/local/sbin/wmcs-backup", line 159, in remove
28Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1817832]: self.backup_entry.remove(noop=noop)
29Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1817832]: File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 366, in remove
30Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1817832]: run_command(
31Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1817832]: File "/usr/lib/python3/dist-packages/rbd2backy2.py", line 27, in run_command
32Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1817832]: output = subprocess.check_output(args).decode("utf8")
33Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1817832]: File "/usr/lib/python3.9/subprocess.py", line 424, in check_output
34Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1817832]: return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
35Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1817832]: File "/usr/lib/python3.9/subprocess.py", line 528, in run
36Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1817832]: raise CalledProcessError(retcode, process.args,
37Feb 22 08:10:09 cloudbackup2002 wmcs-backup[1817832]: subprocess.CalledProcessError: Command '['/usr/bin/backy2', 'rm', '--force', '70389388-ca6c-11ee-9579-c3d729b7>
38Feb 22 08:10:09 cloudbackup2002 systemd[1]: backup_cinder_volumes.service: Control process exited, code=exited, status=1/FAILURE
39Feb 22 08:10:09 cloudbackup2002 systemd[1]: backup_cinder_volumes.service: Failed with result 'exit-code'.
40Feb 22 08:10:09 cloudbackup2002 systemd[1]: Failed to start backup cinder volumes.
41Feb 22 08:10:09 cloudbackup2002 systemd[1]: backup_cinder_volumes.service: Consumed 9min 57.533s CPU time.

Change 1006066 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] P:wmcs::backup_cinder_volumes: avoid race condition

https://gerrit.wikimedia.org/r/1006066

The patch above should fix the race condition, but it still won't clean up the snapshots.

To properly delete the dangling snapshots we need to modify the logic in the ImageBackup.remove method, which is now trying to delete the RBD snapshot but ignoring the RBD exception when the volume is in the trash.

Change 1006066 merged by FNegri:

[operations/puppet@production] P:wmcs::backup_cinder_volumes: avoid race condition

https://gerrit.wikimedia.org/r/1006066

There are potentially other edge cases or race conditions. I created a new test volume volume-8f14e78f-8c95-4bf4-b84f-f702d78ca014 on 2024-02-23, and looking at the logs, a first full backup was taken on 2024-02-24, then differential backups (based on snapshots) were taken on 2024-02-25 and 2024-02-26, but for some reason the snapshots were cleaned up too soon after that date, and full backups were taken again on 2024-02-27 and 2024-02-28.

fnegri@cloudbackup2002:~$ sudo journalctl -g "8f14e78f-8c95-4bf4-b84f-f702d78ca014"
-- Journal begins at Tue 2024-01-23 13:20:51 UTC, ends at Wed 2024-02-28 17:11:25 UTC. --
Feb 24 07:02:14 cloudbackup2002 wmcs-backup[2199364]: INFO:[2024-02-24 07:02:14,634] Backing up image 8f14e78f-8c95-4bf4-b84f-f702d78ca014
Feb 24 07:02:14 cloudbackup2002 wmcs-backup[2199364]: INFO:[2024-02-24 07:02:14,636] Creating full backup of pool:eqiad1-cinder, image_name:volume-8f14e78f-8c95-4bf4-b84f-f702d78ca014, snapshot_name:2024-02-24T07:02:14_cloudbackup2002
Feb 24 07:02:25 cloudbackup2002 wmcs-backup[2199364]: INFO:[2024-02-24 07:02:25,356] Cleaning up expired backups for image test-volume(8f14e78f-8c95-4bf4-b84f-f702d78ca014)
Feb 24 07:02:25 cloudbackup2002 wmcs-backup[2199364]: INFO:[2024-02-24 07:02:25,356] Cleaned up 0 expired backups for image test-volume(8f14e78f-8c95-4bf4-b84f-f702d78ca014)
Feb 24 07:02:25 cloudbackup2002 wmcs-backup[2199364]: INFO:[2024-02-24 07:02:25,356] Backed up image 8f14e78f-8c95-4bf4-b84f-f702d78ca014(test-volume)
Feb 25 07:21:00 cloudbackup2002 wmcs-backup[2390943]: INFO:[2024-02-25 07:21:00,137] Backing up image 8f14e78f-8c95-4bf4-b84f-f702d78ca014
Feb 25 07:21:03 cloudbackup2002 wmcs-backup[2390943]: INFO:[2024-02-25 07:21:03,312] Creating differential backup of pool:eqiad1-cinder image_name:volume-8f14e78f-8c95-4bf4-b84f-f702d78ca014 from rbd snapshot eqiad1-cinder/volume-8f14e78f-8c95-4bf4-b84f-f702d78ca014@2024-02-24T07:02:1>
Feb 25 07:21:16 cloudbackup2002 wmcs-backup[2390943]: INFO:[2024-02-25 07:21:16,481] Cleaning up expired backups for image test-volume(8f14e78f-8c95-4bf4-b84f-f702d78ca014)
Feb 25 07:21:16 cloudbackup2002 wmcs-backup[2390943]: INFO:[2024-02-25 07:21:16,481] Cleaned up 0 expired backups for image test-volume(8f14e78f-8c95-4bf4-b84f-f702d78ca014)
Feb 25 07:21:16 cloudbackup2002 wmcs-backup[2390943]: INFO:[2024-02-25 07:21:16,481] Backed up image 8f14e78f-8c95-4bf4-b84f-f702d78ca014(test-volume)
Feb 26 07:00:42 cloudbackup2002 wmcs-backup[2580528]: INFO:[2024-02-26 07:00:42,126] Backing up image 8f14e78f-8c95-4bf4-b84f-f702d78ca014
Feb 26 07:00:45 cloudbackup2002 wmcs-backup[2580528]: INFO:[2024-02-26 07:00:45,298] Creating differential backup of pool:eqiad1-cinder image_name:volume-8f14e78f-8c95-4bf4-b84f-f702d78ca014 from rbd snapshot eqiad1-cinder/volume-8f14e78f-8c95-4bf4-b84f-f702d78ca014@2024-02-25T07:21:0>
Feb 26 07:00:57 cloudbackup2002 wmcs-backup[2580528]: INFO:[2024-02-26 07:00:57,303] Cleaning up expired backups for image test-volume(8f14e78f-8c95-4bf4-b84f-f702d78ca014)
Feb 26 07:00:57 cloudbackup2002 wmcs-backup[2580528]: INFO:[2024-02-26 07:00:57,303] Cleaned up 0 expired backups for image test-volume(8f14e78f-8c95-4bf4-b84f-f702d78ca014)
Feb 26 07:00:57 cloudbackup2002 wmcs-backup[2580528]: INFO:[2024-02-26 07:00:57,303] Backed up image 8f14e78f-8c95-4bf4-b84f-f702d78ca014(test-volume)
Feb 26 07:01:37 cloudbackup2002 wmcs-backup[2578584]: INFO:[2024-02-26 07:01:37,051] Removing snapshot eqiad1-cinder/volume-8f14e78f-8c95-4bf4-b84f-f702d78ca014@2024-02-26T07:00:43_cloudbackup2002...
Feb 26 07:01:39 cloudbackup2002 wmcs-backup[2578584]: INFO:[2024-02-26 07:01:39,646]        Done, eqiad1-cinder/volume-8f14e78f-8c95-4bf4-b84f-f702d78ca014@2024-02-26T07:00:43_cloudbackup2002 removed
Feb 27 07:00:38 cloudbackup2002 wmcs-backup[2763972]: INFO:[2024-02-27 07:00:38,495] Backing up image 8f14e78f-8c95-4bf4-b84f-f702d78ca014
Feb 27 07:00:43 cloudbackup2002 wmcs-backup[2763972]: INFO:[2024-02-27 07:00:43,150] Creating full backup of pool:eqiad1-cinder, image_name:volume-8f14e78f-8c95-4bf4-b84f-f702d78ca014, snapshot_name:2024-02-27T07:00:43_cloudbackup2002
Feb 27 07:00:53 cloudbackup2002 wmcs-backup[2763972]: INFO:[2024-02-27 07:00:53,085] Cleaning up expired backups for image test-volume(8f14e78f-8c95-4bf4-b84f-f702d78ca014)
Feb 27 07:00:53 cloudbackup2002 wmcs-backup[2763972]: INFO:[2024-02-27 07:00:53,086] Cleaned up 0 expired backups for image test-volume(8f14e78f-8c95-4bf4-b84f-f702d78ca014)
Feb 27 07:00:53 cloudbackup2002 wmcs-backup[2763972]: INFO:[2024-02-27 07:00:53,086] Backed up image 8f14e78f-8c95-4bf4-b84f-f702d78ca014(test-volume)
Feb 27 07:01:33 cloudbackup2002 wmcs-backup[2762943]: INFO:[2024-02-27 07:01:33,667] Removing snapshot eqiad1-cinder/volume-8f14e78f-8c95-4bf4-b84f-f702d78ca014@2024-02-27T07:00:43_cloudbackup2002...
Feb 27 07:01:37 cloudbackup2002 wmcs-backup[2762943]: INFO:[2024-02-27 07:01:37,069]        Done, eqiad1-cinder/volume-8f14e78f-8c95-4bf4-b84f-f702d78ca014@2024-02-27T07:00:43_cloudbackup2002 removed
Feb 28 07:00:08 cloudbackup2002 wmcs-backup[2942241]: INFO:[2024-02-28 07:00:08,416] Backing up image 8f14e78f-8c95-4bf4-b84f-f702d78ca014
Feb 28 07:00:14 cloudbackup2002 wmcs-backup[2942241]: INFO:[2024-02-28 07:00:14,643] Creating full backup of pool:eqiad1-cinder, image_name:volume-8f14e78f-8c95-4bf4-b84f-f702d78ca014, snapshot_name:2024-02-28T07:00:14_cloudbackup2002
Feb 28 07:00:24 cloudbackup2002 wmcs-backup[2942241]: INFO:[2024-02-28 07:00:24,469] Cleaning up expired backups for image test-volume(8f14e78f-8c95-4bf4-b84f-f702d78ca014)
Feb 28 07:00:24 cloudbackup2002 wmcs-backup[2942241]: INFO:[2024-02-28 07:00:24,470] Cleaned up 0 expired backups for image test-volume(8f14e78f-8c95-4bf4-b84f-f702d78ca014)
Feb 28 07:00:24 cloudbackup2002 wmcs-backup[2942241]: INFO:[2024-02-28 07:00:24,470] Backed up image 8f14e78f-8c95-4bf4-b84f-f702d78ca014(test-volume)
Feb 28 07:01:33 cloudbackup2002 wmcs-backup[2942454]: INFO:[2024-02-28 07:01:33,949] Removing snapshot eqiad1-cinder/volume-8f14e78f-8c95-4bf4-b84f-f702d78ca014@2024-02-28T07:00:14_cloudbackup2002...
Feb 28 07:01:37 cloudbackup2002 wmcs-backup[2942454]: INFO:[2024-02-28 07:01:37,372]        Done, eqiad1-cinder/volume-8f14e78f-8c95-4bf4-b84f-f702d78ca014@2024-02-28T07:00:14_cloudbackup2002 removed

After looking more carefully at the logs above, what happened is that remove_dangling_cinder_snapshots.service removed a snapshot that was created by backup_cinder_volumes.service only a minute before. This is again a race condition due to both timers starting at the same time, and should now be fixed by https://gerrit.wikimedia.org/r/1006066, though as @Andrew noted in Gerrit, we should probably combine the two scripts into one to avoid any risk of remove_dangling_cinder_snapshots starting while backup_cinder_volumes from the previous day is still running (it takes several hours to complete).

Change 1007636 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] [wmcs-backup] Don't backup temp toolsdb volumes

https://gerrit.wikimedia.org/r/1007636

I split the cleanup issue into a separate task: T358774: [wmcs-backup] Backup snapshots of deleted volumes are never cleaned up

The issue described in this task should be fixed by the easier workaround in https://gerrit.wikimedia.org/r/1007636

we should probably combine the two scripts into one to avoid any risk of remove_dangling_cinder_snapshots starting while backup_cinder_volumes from the previous day is still running (it takes several hours to complete).

And I created another subtask for this: T358780: [wmcs-backup] Race condition between backup and cleanup timers

Change 1007636 merged by FNegri:

[operations/puppet@production] [wmcs-backup] Don't backup temp toolsdb volumes

https://gerrit.wikimedia.org/r/1007636

The patch above to exclude temp volumes from backup did not work, because of a bug in wmcs-backup: T359192: [wmcs-backup] exclude_volumes is matching on IDs instead of names

fnegri changed the task status from In Progress to Stalled.Mar 5 2024, 4:25 PM
fnegri renamed this task from [toolsdb] [cinder] [ceph] Deleting snapshot does not work to [cinder] Deleting snapshot does not work.Mar 26 2024, 2:00 PM
fnegri changed the task status from Stalled to In Progress.
fnegri lowered the priority of this task from High to Medium.Mar 26 2024, 2:04 PM

This is no longer a blocker for T344717, because the patch https://gerrit.wikimedia.org/r/c/1007636 is now excluding temp volumes from being backed up.

fnegri renamed this task from [cinder] Deleting snapshot does not work to [cinder] [toolsdb] Deleting snapshot does not work.Mar 26 2024, 2:10 PM

Reading again the description, this bug is in effect completely resolved, because the snapshot created during the toolsdb replica procedure can now be deleted.

The remaining work related to this issue is already tracked in the follow-up tasks:

Change #999218 merged by Andrew Bogott:

[operations/puppet@production] Add wmcs-empty-rbd-trash script

https://gerrit.wikimedia.org/r/999218