Page MenuHomePhabricator

[wmcs-backup] Race condition between backup and cleanup timers
Open, LowPublicBUG REPORT

Description

In cloudbackup2002 we run the following timers every 24 hours:

backup_cinder_volumes.timer
remove_dangling_cinder_snapshots.timer

They used to start at the same exact time, now I moved the cleanup timer to start a few hours earlier in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1006066

There is still a potential for a race condition because the backup timer takes many hours to complete (if there's enough data to back up, it could potentially take more than 24 hours).

It would be nicer if we modified the timers to make sure that the the cleanup timer doesn't start while the backup timer from the previous day is still running. One easy solution would be to just run a single timer that runs the "cleanup" command first (it's relatively quick), then the "backup" command later.