Page MenuHomePhabricator

Make continuous backups of NFS data to codfw
Closed, ResolvedPublic

Description

Should be done continuously from eqiad to codfw for all the volumes (Tools, Maps, Others)

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda added a project: Cloud-Services.
yuvipanda added subscribers: coren, mark, Ricordisamoa and 2 others.

@yuvipanda: We now have working on-demand backups, pending a script to manage cleanup of snapshots we could now automate this entirely. Do you have a preference for the retention policy? I was considering doing:

  • clean any snapshot getting too full (as they will become worthless anyways)
  • clean the oldest snapshots remaining until there is enough space for a full set.

If we do daily backups (the original plan) then the process is trivial; this simply needs to be done once before the next set of backups is started.

If we go with your idea of doing backups in a loop, then we'll need to be a little fancier about space management as the smaller filesystems will generate several snapshots per day - including possibly have variably-sized snapshots and resizing since we can't do terabyte-sized snapshots dozens of times per day.

Change 227462 had a related patch set uploaded (by coren):
Add manage-snapshots script

https://gerrit.wikimedia.org/r/227462

Change 227462 merged by coren:
Add cleanup-snapshots script

https://gerrit.wikimedia.org/r/227462

So remaining steps are:

  • Find a way to monitor script failure
  • Find a way to monitor script hasn't run in X hours
  • Make sure that the previous two work (by having them fail)
  • Add systemd timers to run the scripts at schedules.

@coren says we can find out if the script failed or succeeded and the time from systemd itself. Now to write an nrpe check for it...

Considered resolved since the reinstall is the validation (T107574)

Blah. confused two tickets.

Change 230569 had a related patch set uploaded (by coren):
labstore: add timers for backups

https://gerrit.wikimedia.org/r/230569

Change 230569 merged by coren:
labstore: add timers for backups

https://gerrit.wikimedia.org/r/230569

The backups, they are run.