Page MenuHomePhabricator

cloudcontrol1003/4/5: Check for snapshots leaked by cinder backup agent
Closed, ResolvedPublic

Description

From https://alerts.wikimedia.org/?q=team%3Dwmcs:

Check for snapshots leaked by cinder backup agent

42 minutes ago
instance: cloudcontrol1004
42 minutes ago
instance: cloudcontrol1005
42 minutes ago
instance: cloudcontrol1003

summary: 4 snaps in the admin project
source: icinga
team: wmcs

Event Timeline

dcaro triaged this task as High priority.Feb 28 2022, 3:04 PM
dcaro created this task.

It seems that backup_cinder_volumes.service is failing to backup the maps volume 8d687b46-03b8-4308-9b71-13704a664290,
getting timeout for the last three days:

dcaro@cloudcontrol1005:~$ sudo journalctl -u backup_cinder_volumes.service | grep 8d687b46-03b8-4308-9b71-13704a664290
Feb 26 17:13:22 cloudcontrol1005 wmcs-cinder-backup-manager[22131]: wmcs-cinder-backup-manager: 2022-02-26 17:13:22,858: INFO: Backing up ['8d687b46-03b8-4308-9b71-13704a664290'] in project maps
Feb 26 17:13:23 cloudcontrol1005 wmcs-cinder-backup-manager[741621]: wmcs-cinder-volume-backup: 2022-02-26 17:13:23,867: INFO: Backup up volume 8d687b46-03b8-4308-9b71-13704a664290
Feb 27 12:30:26 cloudcontrol1005 wmcs-cinder-backup-manager[741621]: wmcs-cinder-volume-backup: 2022-02-27 12:30:26,189: WARNING: Timed out during backup of volume 8d687b46-03b8-4308-9b71-13704a664290 (maps) cleaing up...
Feb 27 12:30:36 cloudcontrol1005 wmcs-cinder-backup-manager[22131]: wmcs-cinder-backup-manager: 2022-02-27 12:30:36,643: WARNING: Failed to backup volume 8d687b46-03b8-4308-9b71-13704a664290
Feb 27 12:30:36 cloudcontrol1005 wmcs-cinder-backup-manager[22131]: wmcs-cinder-backup-manager: 2022-02-27 12:30:36,643: INFO: Purging old backups of 8d687b46-03b8-4308-9b71-13704a664290
Feb 27 18:29:15 cloudcontrol1005 wmcs-cinder-backup-manager[3019058]: wmcs-cinder-backup-manager: 2022-02-27 18:29:15,882: INFO: Backing up ['8d687b46-03b8-4308-9b71-13704a664290'] in project maps
Feb 27 18:29:16 cloudcontrol1005 wmcs-cinder-backup-manager[3732871]: wmcs-cinder-volume-backup: 2022-02-27 18:29:16,818: INFO: Backup up volume 8d687b46-03b8-4308-9b71-13704a664290
Feb 28 13:44:26 cloudcontrol1005 wmcs-cinder-backup-manager[3732871]: wmcs-cinder-volume-backup: 2022-02-28 13:44:26,304: WARNING: Timed out during backup of volume 8d687b46-03b8-4308-9b71-13704a664290 (maps) cleaing up...
Feb 28 13:44:36 cloudcontrol1005 wmcs-cinder-backup-manager[3019058]: wmcs-cinder-backup-manager: 2022-02-28 13:44:36,740: WARNING: Failed to backup volume 8d687b46-03b8-4308-9b71-13704a664290
Feb 28 13:44:36 cloudcontrol1005 wmcs-cinder-backup-manager[3019058]: wmcs-cinder-backup-manager: 2022-02-28 13:44:36,740: INFO: Purging old backups of 8d687b46-03b8-4308-9b71-13704a664290

Now, I'm not sure why we did not get notified about the service itself failing.

oh, it did not fail because it take >24h to do the backup, and the timer re-triggers it right after failure not giving
nrpe a chance to notice it failed.

dcaro@cloudcontrol1005:~$ grep OnCalendar /lib/systemd/system/backup_cinder_volumes.timer
OnCalendar=*-*-* 10:30:00
dcaro@cloudcontrol1005:~$ sudo journalctl -u backup_cinder_volumes.service | grep -B 1 -i started
Feb 27 12:30:37 cloudcontrol1005 systemd[1]: backup_cinder_volumes.service: Failed with result 'exit-code'.
Feb 27 12:30:37 cloudcontrol1005 systemd[1]: Started Backup select cinder volumes using wmcs-cinder-backup-manager.py.
--
Feb 28 13:44:37 cloudcontrol1005 systemd[1]: backup_cinder_volumes.service: Failed with result 'exit-code'.
Feb 28 13:44:37 cloudcontrol1005 systemd[1]: Started Backup select cinder volumes using wmcs-cinder-backup-manager.py.

^ it's delaying >1h every day.

Mentioned in SAL (#wikimedia-cloud) [2022-02-28T15:30:09Z] <dcaro> cleaning up leftover snapshots from failed backups of the maps volume (T302720)

Some possible fixes:

  • Run the timer every 48 hours and increase the single backup timeout
  • Improve the speed of the backups
    • Maybe parallelizing them (that still means increasing the single backup timeout)
    • Maybe figuring out why maps is so slow and fixing the bottleneck

Some improvements:

  • Make the timer run depending on the last time the service unit ran instead of a specified time (that would allow detecting the failed service unite state)

Change 766816 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] wmcs-cinder-backups: Increase timeout and decrease frequency

https://gerrit.wikimedia.org/r/766816

Change 766816 merged by David Caro:

[operations/puppet@production] wmcs-cinder-backups: Increase timeout and decrease frequency

https://gerrit.wikimedia.org/r/766816

Cleaned up the snapshots and changed the timer to run ~every 48 hours, and increased the timeout for a single backup to 22h (that's the estimated time the longest backup will take, would be nice to properly estimate the full run, but have to move to other things now).