Page MenuHomePhabricator

cloudcotrol1005 - Check unit status of backup_cinder_volumes
Closed, ResolvedPublic

Description

From https://alerts.wikimedia.org/?q=team%3Dwmcs:

Check unit status of backup_cinder_volumes
summary: CRITICAL: Status of the systemd unit backup_cinder_volumes
3 hours ago
instance: cloudcontrol1005
source: icinga
team: wmcs

Event Timeline

dcaro triaged this task as High priority.Feb 22 2022, 3:11 PM
dcaro created this task.
dcaro changed the task status from Open to In Progress.Feb 22 2022, 3:18 PM
dcaro moved this task from To refine to Doing on the User-dcaro board.

This has cleared, though the serivce is still silently failing:

root@cloudcontrol1005:~# systemctl status backup_cinder_volumes.service
● backup_cinder_volumes.service - Backup select cinder volumes using wmcs-cinder-backup-manager.py
     Loaded: loaded (/lib/systemd/system/backup_cinder_volumes.service; static)
     Active: inactive (dead) since Wed 2022-02-23 10:34:45 UTC; 4h 3min ago
TriggeredBy: ● backup_cinder_volumes.timer
   Main PID: 3615612 (code=exited, status=0/SUCCESS)

Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:     return self._cs_request(url, 'POST', **kwargs)
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:   File "/usr/lib/python3/dist-packages/cinderclient/client.py", line 206, in _cs_request
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:     return self.request(url, method, **kwargs)
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:   File "/usr/lib/python3/dist-packages/cinderclient/client.py", line 192, in request
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]:     raise exceptions.from_response(resp, body)
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3625025]: cinderclient.exceptions.OverLimit: SnapshotLimitExceeded: Maximum number of snapshots allowed (16) exceeded (HTTP 413) (Request-ID: req-e077f045-2c08-4e80-ad08-377d3037cc82)
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3615612]: wmcs-cinder-backup-manager: 2022-02-23 10:34:44,630: WARNING: Failed to backup volume 8d687b46-03b8-4308-9b71-13704a664290
Feb 23 10:34:44 cloudcontrol1005 wmcs-cinder-backup-manager[3615612]: wmcs-cinder-backup-manager: 2022-02-23 10:34:44,631: INFO: Purging old backups of 8d687b46-03b8-4308-9b71-13704a664290
Feb 23 10:34:45 cloudcontrol1005 wmcs-cinder-backup-manager[3625062]: wmcs-cinder-volume-backup: 2022-02-23 10:34:45,558: INFO: Purged 0 backups
Feb 23 10:34:45 cloudcontrol1005 systemd[1]: backup_cinder_volumes.service: Succeeded.

Change 765475 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] wmcs-cinder-backup-manager: end with error if any backup failed

https://gerrit.wikimedia.org/r/765475

Let's try a more thorough approach, the script is still failing, and this are the volumes that fail:

root@cloudcontrol1005:~# journalctl -u backup_cinder_volumes.service -n 500 | grep 'Failed to backup' | awk '{print $14}' | sort | uniq
1c5e694b-7771-4fa2-825e-9cb8c317f31a
262bdfea-3875-48e3-a7de-a8912606caaf
39657ac0-e9bd-499b-84ee-342442b25122
3e6261a0-8a73-42bb-82ff-44038afea709
5751c82f-fc8f-4d4c-83ba-38ac37184f85
642eaa8f-1cc7-403a-8b1e-a06562acb583
7b037262-7214-4cef-a876-a55e26bc43be
87945019-b5eb-4463-827c-83d2f8eada17
8d687b46-03b8-4308-9b71-13704a664290
9cf36ce5-1812-4685-87d6-86857cfe4448
d1478efd-9fa6-4293-8389-e72459b794c0
dab15355-93ee-4baa-9a88-3f8e4f35b583
e97f4135-c19c-44f4-ae23-e2f04a047fc4
f572f3fc-0bc0-4ee7-98fa-fcd12b7d8c44
f8a19a5f-a872-494e-ba22-839192c2bffd
fc0e0b35-a27c-4d5a-a30b-2b1ada59560e

There's a run starting right now, it seems to be going on without failures, will monitor that one.

Change 765475 merged by David Caro:

[operations/puppet@production] wmcs-cinder-backup-manager: end with error if any backup failed

https://gerrit.wikimedia.org/r/765475

This last run worked \o/, and the patch to alert on errors is merged, I'll close this.