Should be done continuously from eqiad to codfw for all the volumes (Tools, Maps, Others)
Description
Details
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
operations/puppet | production | +31 -0 | labstore: add timers for backups | |
operations/puppet | production | +146 -0 | Add cleanup-snapshots script |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | yuvipanda | T105720 Labs team reliability goal for Q1 2015/16 | |||
Resolved | coren | T106474 Make continuous backups of NFS data to codfw | |||
Invalid | None | T106871 paramiko (python SSH implementation) needs older hashes for host authentication |
Event Timeline
Backup recovery steps in process of being documented at https://wikitech.wikimedia.org/wiki/NFS_Backups
@yuvipanda: We now have working on-demand backups, pending a script to manage cleanup of snapshots we could now automate this entirely. Do you have a preference for the retention policy? I was considering doing:
- clean any snapshot getting too full (as they will become worthless anyways)
- clean the oldest snapshots remaining until there is enough space for a full set.
If we do daily backups (the original plan) then the process is trivial; this simply needs to be done once before the next set of backups is started.
If we go with your idea of doing backups in a loop, then we'll need to be a little fancier about space management as the smaller filesystems will generate several snapshots per day - including possibly have variably-sized snapshots and resizing since we can't do terabyte-sized snapshots dozens of times per day.
Change 227462 had a related patch set uploaded (by coren):
Add manage-snapshots script
So remaining steps are:
- Find a way to monitor script failure
- Find a way to monitor script hasn't run in X hours
- Make sure that the previous two work (by having them fail)
- Add systemd timers to run the scripts at schedules.
@coren says we can find out if the script failed or succeeded and the time from systemd itself. Now to write an nrpe check for it...
Change 230569 had a related patch set uploaded (by coren):
labstore: add timers for backups