Make continuous backups of NFS data to codfw
Closed, ResolvedPublic

Description

Should be done continuously from eqiad to codfw for all the volumes (Tools, Maps, Others)

yuvipanda updated the task description. (Show Details)
yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda added a project: Cloud-Services.
yuvipanda added subscribers: coren, mark, Ricordisamoa and 2 others.

Backup recovery steps in process of being documented at https://wikitech.wikimedia.org/wiki/NFS_Backups

coren added a comment.Jul 27 2015, 2:50 PM

@yuvipanda: We now have working on-demand backups, pending a script to manage cleanup of snapshots we could now automate this entirely. Do you have a preference for the retention policy? I was considering doing:

  • clean any snapshot getting too full (as they will become worthless anyways)
  • clean the oldest snapshots remaining until there is enough space for a full set.

If we do daily backups (the original plan) then the process is trivial; this simply needs to be done once before the next set of backups is started.

If we go with your idea of doing backups in a loop, then we'll need to be a little fancier about space management as the smaller filesystems will generate several snapshots per day - including possibly have variably-sized snapshots and resizing since we can't do terabyte-sized snapshots dozens of times per day.

coren moved this task from To Do to Doing on the Labs-Sprint-107 board.Jul 27 2015, 5:34 PM

Change 227462 had a related patch set uploaded (by coren):
Add manage-snapshots script

https://gerrit.wikimedia.org/r/227462

coren claimed this task.Jul 28 2015, 3:49 PM
coren moved this task from Doing to Code Review / Blocked on the Labs-Sprint-107 board.

Change 227462 merged by coren:
Add cleanup-snapshots script

https://gerrit.wikimedia.org/r/227462

So remaining steps are:

  • Find a way to monitor script failure
  • Find a way to monitor script hasn't run in X hours
  • Make sure that the previous two work (by having them fail)
  • Add systemd timers to run the scripts at schedules.

@coren says we can find out if the script failed or succeeded and the time from systemd itself. Now to write an nrpe check for it...

coren closed this task as Resolved.Aug 3 2015, 5:37 PM

Considered resolved since the reinstall is the validation (T107574)

coren reopened this task as Open.Aug 3 2015, 5:38 PM

Blah. confused two tickets.

coren moved this task from To Do to Code Review / Blocked on the Labs-Sprint-108 board.

Change 230569 had a related patch set uploaded (by coren):
labstore: add timers for backups

https://gerrit.wikimedia.org/r/230569

Change 230569 merged by coren:
labstore: add timers for backups

https://gerrit.wikimedia.org/r/230569

coren closed this task as Resolved.Aug 12 2015, 1:19 PM

The backups, they are run.