Page MenuHomePhabricator

icinga alert: Check for snapshots leaked by cinder backup agent
Closed, ResolvedPublic

Description

Some errors in this service triggered an alert today:

Feb 23 10:34:41 cloudcontrol1005 wmcs-cinder-backup-manager[3624673]: wmcs-cinder-volume-backup: 2022-02-23 10:34:41,720: INFO: Cleaning up snapshot d4aad7fb-97ed-4fa5-a06b-ae7f4b76feab
Feb 23 10:34:41 cloudcontrol1005 wmcs-cinder-backup-manager[3624673]: Traceback (most recent call last):
Feb 23 10:34:41 cloudcontrol1005 wmcs-cinder-backup-manager[3624673]:   File "/usr/local/sbin/wmcs-cinder-volume-backup", line 271, in <module>
Feb 23 10:34:41 cloudcontrol1005 wmcs-cinder-backup-manager[3624673]:     backup.backup_volume()
Feb 23 10:34:41 cloudcontrol1005 wmcs-cinder-backup-manager[3624673]:   File "/usr/local/sbin/wmcs-cinder-volume-backup", line 191, in backup_volume
Feb 23 10:34:41 cloudcontrol1005 wmcs-cinder-backup-manager[3624673]:     self.wait_for_resource_status(
Feb 23 10:34:41 cloudcontrol1005 wmcs-cinder-backup-manager[3624673]:   File "/usr/local/sbin/wmcs-cinder-volume-backup", line 70, in wait_for_resource_status
Feb 23 10:34:41 cloudcontrol1005 wmcs-cinder-backup-manager[3624673]:     resource = refreshfunction(resource.id)
Feb 23 10:34:41 cloudcontrol1005 wmcs-cinder-backup-manager[3624673]:   File "/usr/lib/python3/dist-packages/cinderclient/v2/volume_backups.py", line 78, in get
Feb 23 10:34:41 cloudcontrol1005 wmcs-cinder-backup-manager[3624673]:     return self._get("/backups/%s" % backup_id, "backup")
Feb 23 10:34:41 cloudcontrol1005 wmcs-cinder-backup-manager[3624673]:   File "/usr/lib/python3/dist-packages/cinderclient/base.py", line 293, in _get
Feb 23 10:34:41 cloudcontrol1005 wmcs-cinder-backup-manager[3624673]:     resp, body = self.api.client.get(url)
Feb 23 10:34:41 cloudcontrol1005 wmcs-cinder-backup-manager[3624673]:   File "/usr/lib/python3/dist-packages/cinderclient/client.py", line 215, in get
Feb 23 10:34:41 cloudcontrol1005 wmcs-cinder-backup-manager[3624673]:     return self._cs_request(url, 'GET', **kwargs)
Feb 23 10:34:41 cloudcontrol1005 wmcs-cinder-backup-manager[3624673]:   File "/usr/lib/python3/dist-packages/cinderclient/client.py", line 206, in _cs_request
Feb 23 10:34:41 cloudcontrol1005 wmcs-cinder-backup-manager[3624673]:     return self.request(url, method, **kwargs)
Feb 23 10:34:41 cloudcontrol1005 wmcs-cinder-backup-manager[3624673]:   File "/usr/lib/python3/dist-packages/cinderclient/client.py", line 192, in request
Feb 23 10:34:41 cloudcontrol1005 wmcs-cinder-backup-manager[3624673]:     raise exceptions.from_response(resp, body)
Feb 23 10:34:41 cloudcontrol1005 wmcs-cinder-backup-manager[3624673]: cinderclient.exceptions.ClientException: Internal Server Error (HTTP 500)
Feb 23 10:34:41 cloudcontrol1005 wmcs-cinder-backup-manager[3615612]: wmcs-cinder-backup-manager: 2022-02-23 10:34:41,926: WARNING: Failed to backup volume f8a19a5f-a872-494e-ba22-839192c2bffd
Feb 23 10:34:41 cloudcontrol1005 wmcs-cinder-backup-manager[3615612]: wmcs-cinder-backup-manager: 2022-02-23 10:34:41,926: INFO: Purging old backups of f8a19a5f-a872-494e-ba22-839192c2bffd
Feb 23 10:34:42 cloudcontrol1005 wmcs-cinder-backup-manager[3624979]: wmcs-cinder-volume-backup: 2022-02-23 10:34:42,899: INFO: Purged 0 backups
Feb 23 10:34:43 cloudcontrol1005 wmcs-cinder-backup-manager[3615612]: wmcs-cinder-backup-manager: 2022-02-23 10:34:43,279: INFO: Backing up ['8d687b46-03b8-4308-9b71-13704a664290'] in project maps
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]: wmcs-cinder-volume-backup: 2022-02-23 10:34:44,182: INFO: Backup up volume 8d687b46-03b8-4308-9b71-13704a664290
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]: wmcs-cinder-volume-backup: 2022-02-23 10:34:44,382: INFO: Backup up from snapshot
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]: Traceback (most recent call last):
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:   File "/usr/local/sbin/wmcs-cinder-volume-backup", line 271, in <module>
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:     backup.backup_volume()
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:   File "/usr/local/sbin/wmcs-cinder-volume-backup", line 159, in backup_volume
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:     snapjob_rec = self.cinderclient.volume_snapshots.create(
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:   File "/usr/lib/python3/dist-packages/cinderclient/v3/volume_snapshots.py", line 107, in create
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:     return self._create('/snapshots', body, 'snapshot')
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:   File "/usr/lib/python3/dist-packages/cinderclient/base.py", line 302, in _create
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:     resp, body = self.api.client.post(url, body=body)
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:   File "/usr/lib/python3/dist-packages/cinderclient/client.py", line 218, in post
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:     return self._cs_request(url, 'POST', **kwargs)
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:   File "/usr/lib/python3/dist-packages/cinderclient/client.py", line 206, in _cs_request
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:     return self.request(url, method, **kwargs)
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:   File "/usr/lib/python3/dist-packages/cinderclient/client.py", line 192, in request
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:     raise exceptions.from_response(resp, body)
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]: cinderclient.exceptions.OverLimit: SnapshotLimitExceeded: Maximum number of snapshots allowed (16) exceeded (HTTP 413) (Request-ID: req-e077f045-2c08-4e80-ad08-377d3037cc82)
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3615612]: wmcs-cinder-backup-manager: 2022-02-23 10:34:44,630: WARNING: Failed to backup volume 8d687b46-03b8-4308-9b71-13704a664290

Event Timeline

Looking into the cinder-volume service logs on cloudcontrol1005, there's the error:

root@cloudcontrol1005:~# journalctl -u cinder-volume.service -f
...
Feb 23 10:32:29 cloudcontrol1005 cinder-volume[8578]: 2022-02-23 10:32:29.689 8578 ERROR oslo_messaging.rpc.server [req-3870c2c8-b4d0-4379-ac59-2105d87c27f0 novaadmin admin - - -] Exception during message handling: cinder.exception.SnapshotNotFound: Snapshot dc25b718-5900-4f0b-9197-819ce28db3f0 could not be found.
...

There's some turmoil of DB errors and python libs missing too, but those are from some time before.
Restarted the service, but not sure it helped.

Dangling volume snapshots in the admin project for cinder backups:

aborrero@cloudcontrol1005:~ 5s $ sudo wmcs-openstack volume snapshot list
+--------------------------------------+-----------------------------------------------------+-------------+-----------+------+
| ID                                   | Name                                                | Description | Status    | Size |
+--------------------------------------+-----------------------------------------------------+-------------+-----------+------+
| d4aad7fb-97ed-4fa5-a06b-ae7f4b76feab | wmde-templates-alpha-nfs-2022-02-23T10:34:32.423757 | None        | available |   10 |
| 4406f4ce-ca22-4f57-a8e5-8dff8cf32270 | wikilink-nfs-2022-02-23T10:34:01.855598             | None        | available |   10 |
| e5c9d3ef-3d8a-40f5-90f0-900f1e87297a | wikidumpparse-nfs-2022-02-23T10:32:36.696177        | None        | available |  260 |
| 9d9aba32-9795-4d60-9d00-1005f5a19483 | proxy-03-backup-2022-02-23T10:32:08.152936          | None        | available |   10 |
| a4acc0c9-2a56-4bb4-bace-644a838a4922 | proxy-04-backup-2022-02-23T10:32:02.187232          | None        | available |   10 |
| 26ce6bea-6174-4960-9951-3ac8786cef96 | dumps-nfs-2022-02-23T10:31:14.228836                | None        | available |   80 |
| b33fde43-703d-4fea-a27b-90a77b6fc049 | twl-nfs-2022-02-23T09:30:51.449991                  | None        | available |  100 |
| 77e4b1dd-7115-44d9-8dc5-d10999fb1003 | testlabs-nfs-2022-02-23T09:30:42.998448             | None        | available |   40 |
| 0b02c50c-53f2-478e-8e2f-dc110b9972fb | quarry-nfs-2022-02-23T09:28:07.622987               | None        | available |  400 |
| 4716e085-6ebd-4da9-974d-0b891fab6d92 | proxy-04-backup-2022-02-23T09:27:52.369365          | None        | available |   10 |
| 2b347ed5-0dca-4495-8be7-8cd24efdea59 | huggle-nfs-2022-02-23T09:27:33.000022               | None        | available |   40 |
| 405b056c-530f-479c-9e2c-630248ae5c20 | dumps-nfs-2022-02-23T09:27:23.461385                | None        | available |   80 |
| 7f7676a4-c7b0-4dc2-8146-d76764afd6a8 | cvn-nfs-2022-02-23T09:27:14.921842                  | None        | available |    8 |
| f4d18036-f2f9-4c3b-8dd8-39cff9081925 | scratch-2022-02-23T09:25:37.183037                  | None        | available | 3072 |
| e6bf9c4c-a262-40e3-8beb-9c19545924e9 | utrs-nfs-2022-02-21T17:28:35.599328                 | None        | deleting  |   10 |
| 3d215281-4e22-40ce-852b-9555b7727f35 | quarry-nfs-2022-02-21T16:35:24.291820               | None        | available |  400 |
+--------------------------------------+-----------------------------------------------------+-------------+-----------+------+

Will try to manually clean them up.

Mentioned in SAL (#wikimedia-cloud) [2022-02-23T12:13:48Z] <arturo> cleaning up cinder volume snapshots, aborrero@cloudcontrol1005:~$ for i in $(sudo wmcs-openstack volume snapshot list -f value -c ID) ; do sudo wmcs-openstack volume snapshot delete $i ; done (T302382)

Some snapshots wont go away:

aborrero@cloudcontrol1005:~$ sudo wmcs-openstack volume snapshot list
+--------------------------------------+-----------------------------------------+-------------+-----------+------+
| ID                                   | Name                                    | Description | Status    | Size |
+--------------------------------------+-----------------------------------------+-------------+-----------+------+
| 4406f4ce-ca22-4f57-a8e5-8dff8cf32270 | wikilink-nfs-2022-02-23T10:34:01.855598 | None        | available |   10 |
| f4d18036-f2f9-4c3b-8dd8-39cff9081925 | scratch-2022-02-23T09:25:37.183037      | None        | available | 3072 |
| e6bf9c4c-a262-40e3-8beb-9c19545924e9 | utrs-nfs-2022-02-21T17:28:35.599328     | None        | deleting  |   10 |
+--------------------------------------+-----------------------------------------+-------------+-----------+------+

For at least one we have information as to why the snapshot is there:

Delete snapshot failed, due to snapshot busy.: cinder.exception.SnapshotIsBusy: deleting snapshot snapshot-f4d18036-f2f9-4c3b-8dd8-39cff9081925 that has dependent volumes

Actually this happened with a lot of the snapshots:

image.png (764×1 px, 209 KB)

So perhaps there is some kind of race condition somewhere between:

  • creating the snapshot
  • running the backup off it
  • deleting the snapshot

There were several database issues just before the snapshots started leaking

image.png (425×1 px, 248 KB)

So this is likely another symptom of T302146: Galera on cloudcontrol1004 going out of sync

dcaro changed the task status from Open to In Progress.Feb 23 2022, 2:16 PM
dcaro claimed this task.
dcaro moved this task from Today to Doing on the User-dcaro board.

The service that triggers the backups is silently failing (right now, it was alerting yesterday, see T302299):

root@cloudcontrol1005:~# systemctl status backup_cinder_volumes.service
● backup_cinder_volumes.service - Backup select cinder volumes using wmcs-cinder-backup-manager.py
     Loaded: loaded (/lib/systemd/system/backup_cinder_volumes.service; static)
     Active: inactive (dead) since Wed 2022-02-23 10:34:45 UTC; 4h 3min ago
TriggeredBy: ● backup_cinder_volumes.timer
   Main PID: 3615612 (code=exited, status=0/SUCCESS)

Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:     return self._cs_request(url, 'POST', **kwargs)
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:   File "/usr/lib/python3/dist-packages/cinderclient/client.py", line 206, in _cs_request
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:     return self.request(url, method, **kwargs)
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:   File "/usr/lib/python3/dist-packages/cinderclient/client.py", line 192, in request
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:     raise exceptions.from_response(resp, body)
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]: cinderclient.exceptions.OverLimit: SnapshotLimitExceeded: Maximum number of snapshots allowed (16) exceeded (HTTP 413) (Request-ID: req-e077f045-2c08-4e80-ad08-377d3037cc82)
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3615612]: wmcs-cinder-backup-manager: 2022-02-23 10:34:44,630: WARNING: Failed to backup volume 8d687b46-03b8-4308-9b71-13704a664290
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3615612]: wmcs-cinder-backup-manager: 2022-02-23 10:34:44,631: INFO: Purging old backups of 8d687b46-03b8-4308-9b71-13704a664290
Feb 23 10:34:45 cloudcontrol1005 wmcs-cinder-backup-manager[3625062]: wmcs-cinder-volume-backup: 2022-02-23 10:34:45,558: INFO: Purged 0 backups
Feb 23 10:34:45 cloudcontrol1005 systemd[1]: backup_cinder_volumes.service: Succeeded.

Triggered the backups again, to see if with the current "stability" they would work, but keep getting the HTTP 500 error:

Feb 23 14:47:24 cloudcontrol1005 wmcs-cinder-backup-manager[4118726]: wmcs-cinder-volume-backup: 2022-02-23 14:47:24,834: INFO: Cleaning up snapshot 8312e871-1b7b-4762-9f4f-ac2ab07a0340
Feb 23 14:47:24 cloudcontrol1005 wmcs-cinder-backup-manager[4118726]: Traceback (most recent call last):
Feb 23 14:47:24 cloudcontrol1005 wmcs-cinder-backup-manager[4118726]:   File "/usr/local/sbin/wmcs-cinder-volume-backup", line 271, in <module>
Feb 23 14:47:24 cloudcontrol1005 wmcs-cinder-backup-manager[4118726]:     backup.backup_volume()
Feb 23 14:47:24 cloudcontrol1005 wmcs-cinder-backup-manager[4118726]:   File "/usr/local/sbin/wmcs-cinder-volume-backup", line 191, in backup_volume
Feb 23 14:47:24 cloudcontrol1005 wmcs-cinder-backup-manager[4118726]:     self.wait_for_resource_status(
Feb 23 14:47:24 cloudcontrol1005 wmcs-cinder-backup-manager[4118726]:   File "/usr/local/sbin/wmcs-cinder-volume-backup", line 70, in wait_for_resource_status
Feb 23 14:47:24 cloudcontrol1005 wmcs-cinder-backup-manager[4118726]:     resource = refreshfunction(resource.id)
Feb 23 14:47:24 cloudcontrol1005 wmcs-cinder-backup-manager[4118726]:   File "/usr/lib/python3/dist-packages/cinderclient/v2/volume_backups.py", line 78, in get
Feb 23 14:47:24 cloudcontrol1005 wmcs-cinder-backup-manager[4118726]:     return self._get("/backups/%s" % backup_id, "backup")
Feb 23 14:47:24 cloudcontrol1005 wmcs-cinder-backup-manager[4118726]:   File "/usr/lib/python3/dist-packages/cinderclient/base.py", line 293, in _get
Feb 23 14:47:24 cloudcontrol1005 wmcs-cinder-backup-manager[4118726]:     resp, body = self.api.client.get(url)
Feb 23 14:47:24 cloudcontrol1005 wmcs-cinder-backup-manager[4118726]:   File "/usr/lib/python3/dist-packages/cinderclient/client.py", line 215, in get
Feb 23 14:47:24 cloudcontrol1005 wmcs-cinder-backup-manager[4118726]:     return self._cs_request(url, 'GET', **kwargs)
Feb 23 14:47:24 cloudcontrol1005 wmcs-cinder-backup-manager[4118726]:   File "/usr/lib/python3/dist-packages/cinderclient/client.py", line 206, in _cs_request
Feb 23 14:47:24 cloudcontrol1005 wmcs-cinder-backup-manager[4118726]:     return self.request(url, method, **kwargs)
Feb 23 14:47:24 cloudcontrol1005 wmcs-cinder-backup-manager[4118726]:   File "/usr/lib/python3/dist-packages/cinderclient/client.py", line 192, in request
Feb 23 14:47:24 cloudcontrol1005 wmcs-cinder-backup-manager[4118726]:     raise exceptions.from_response(resp, body)
Feb 23 14:47:24 cloudcontrol1005 wmcs-cinder-backup-manager[4118726]: cinderclient.exceptions.ClientException: Internal Server Error (HTTP 500)

So still looking

And something is happenning xd

root@cloudcontrol1005:~# wmcs-openstack volume snapshot list
Internal Server Error (HTTP 500)

I think this is mostly/all fallout from the galera sync issue (T302146). I'm cleaning up snapshots like this:

$ openstack volume snapshot delete XXXX

...and then when that doesn't work, looking for backups that are stuck and associated with the bad snapshots with

$ openstack volume backup list | grep -v available

and

$openstack volume backup delete --force <id>

And then another round of

$openstack volume snapshot delete

The final snap I was able to clean up by looking in logstash:

'Image eqiad1-cinder/volume-f0d22795-dc59-4951-b855-09691a4dad33 is dependent on the snapshot snapshot-8312e871-1b7b-4762-9f4f-ac2ab07a0340.'

I'm not sure what that dependent volume is about but I deleted it and the snap.

I'm re-running the backup jobs (minus the biggest volumes) and it seems fine. So closing this specific bug, even though cinder-backup will continue to need attention.