From https://alerts.wikimedia.org/?q=team%3Dwmcs:
Check unit status of backup_cinder_volumes summary: CRITICAL: Status of the systemd unit backup_cinder_volumes 3 hours ago instance: cloudcontrol1005 source: icinga team: wmcs
From https://alerts.wikimedia.org/?q=team%3Dwmcs:
Check unit status of backup_cinder_volumes summary: CRITICAL: Status of the systemd unit backup_cinder_volumes 3 hours ago instance: cloudcontrol1005 source: icinga team: wmcs
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
wmcs-cinder-backup-manager: end with error if any backup failed | operations/puppet | production | +6 -0 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | dcaro | T302299 cloudcotrol1005 - Check unit status of backup_cinder_volumes | |||
Resolved | dcaro | T302382 icinga alert: Check for snapshots leaked by cinder backup agent |
This has cleared, though the serivce is still silently failing:
root@cloudcontrol1005:~# systemctl status backup_cinder_volumes.service ● backup_cinder_volumes.service - Backup select cinder volumes using wmcs-cinder-backup-manager.py Loaded: loaded (/lib/systemd/system/backup_cinder_volumes.service; static) Active: inactive (dead) since Wed 2022-02-23 10:34:45 UTC; 4h 3min ago TriggeredBy: ● backup_cinder_volumes.timer Main PID: 3615612 (code=exited, status=0/SUCCESS) Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]: return self._cs_request(url, 'POST', **kwargs) Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]: File "/usr/lib/python3/dist-packages/cinderclient/client.py", line 206, in _cs_request Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]: return self.request(url, method, **kwargs) Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]: File "/usr/lib/python3/dist-packages/cinderclient/client.py", line 192, in request Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]: raise exceptions.from_response(resp, body) Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]: cinderclient.exceptions.OverLimit: SnapshotLimitExceeded: Maximum number of snapshots allowed (16) exceeded (HTTP 413) (Request-ID: req-e077f045-2c08-4e80-ad08-377d3037cc82) Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3615612]: wmcs-cinder-backup-manager: 2022-02-23 10:34:44,630: WARNING: Failed to backup volume 8d687b46-03b8-4308-9b71-13704a664290 Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3615612]: wmcs-cinder-backup-manager: 2022-02-23 10:34:44,631: INFO: Purging old backups of 8d687b46-03b8-4308-9b71-13704a664290 Feb 23 10:34:45 cloudcontrol1005 wmcs-cinder-backup-manager[3625062]: wmcs-cinder-volume-backup: 2022-02-23 10:34:45,558: INFO: Purged 0 backups Feb 23 10:34:45 cloudcontrol1005 systemd[1]: backup_cinder_volumes.service: Succeeded.
Change 765475 had a related patch set uploaded (by David Caro; author: David Caro):
[operations/puppet@production] wmcs-cinder-backup-manager: end with error if any backup failed
Let's try a more thorough approach, the script is still failing, and this are the volumes that fail:
root@cloudcontrol1005:~# journalctl -u backup_cinder_volumes.service -n 500 | grep 'Failed to backup' | awk '{print $14}' | sort | uniq 1c5e694b-7771-4fa2-825e-9cb8c317f31a 262bdfea-3875-48e3-a7de-a8912606caaf 39657ac0-e9bd-499b-84ee-342442b25122 3e6261a0-8a73-42bb-82ff-44038afea709 5751c82f-fc8f-4d4c-83ba-38ac37184f85 642eaa8f-1cc7-403a-8b1e-a06562acb583 7b037262-7214-4cef-a876-a55e26bc43be 87945019-b5eb-4463-827c-83d2f8eada17 8d687b46-03b8-4308-9b71-13704a664290 9cf36ce5-1812-4685-87d6-86857cfe4448 d1478efd-9fa6-4293-8389-e72459b794c0 dab15355-93ee-4baa-9a88-3f8e4f35b583 e97f4135-c19c-44f4-ae23-e2f04a047fc4 f572f3fc-0bc0-4ee7-98fa-fcd12b7d8c44 f8a19a5f-a872-494e-ba22-839192c2bffd fc0e0b35-a27c-4d5a-a30b-2b1ada59560e
There's a run starting right now, it seems to be going on without failures, will monitor that one.
Change 765475 merged by David Caro:
[operations/puppet@production] wmcs-cinder-backup-manager: end with error if any backup failed
This last run worked \o/, and the patch to alert on errors is merged, I'll close this.