Page MenuHomePhabricator

Process for user backups
Closed, DeclinedPublic

Description

The basic requirement is to allow Labs endusers to recover past versions of files/databases for disaster and error recovery.

This used to work with timetravel snapshots, but that feature was disable when it was suspected to cause instabilities late 2013, and has remained disabled since. The true culprit having been found, reenabling it should be the first layer of redundancy.

In addition, replication of hourly snapshots between the DCs will provide for an cross-site backups.

As this covers only the filesystems, a mechanism needs to be devised by which users can also make safety backups of databases. In practice, a simple mechanism allowing scheduled dumps of the database to the filesystem will allow for the latter to handle redundancy.

Event Timeline

coren created this task.Dec 31 2014, 2:50 PM
coren raised the priority of this task from to Needs Triage.
coren updated the task description. (Show Details)
coren added a project: Cloud-Services.
coren added a subscriber: coren.
coren updated the task description. (Show Details)Dec 31 2014, 3:32 PM
coren set Security to None.
coren triaged this task as Normal priority.Feb 10 2015, 9:34 PM
coren moved this task from Triage to Stalled on the Cloud-Services board.
coren added a comment.Mar 2 2015, 3:31 PM

Bugs in LVM2 make thin snapshots iffy, I'm backporting lvm2 back to Precise to fix.
(WIP at https://launchpad.net/~marc-u/+archive/ubuntu/wmf/+packages)

coren added a comment.Mar 24 2015, 2:10 AM

Jessie saves. Snapshots are back, and working, but not yet user-accessible (design work will be needed, perhaps automount?) At the very least, once we turn the feature on, admins can recover user files.

coren moved this task from Stalled to In Progress on the Cloud-Services board.Mar 24 2015, 2:10 AM

This is now working on the new (not live) filesystem. We are pending only the switch.

Change 199267 had a related patch set uploaded (by coren):
WIP: Proper labs_storage class

https://gerrit.wikimedia.org/r/199267

coren moved this task from Backlog to Doing on the Labs-Q4-Sprint-1 board.
coren removed a project: Labs-Q4-Sprint-1.

Change 199267 abandoned by coren:
WIP: Proper labs_storage class

Reason:
Superseeded by https://gerrit.wikimedia.org/r/220618

https://gerrit.wikimedia.org/r/199267

yuvipanda closed this task as Declined.Jul 24 2015, 3:35 AM
yuvipanda claimed this task.
yuvipanda added a subscriber: yuvipanda.

I do not think we should offer time travel backups that are user recoverable, at least for the time. Let's work on getting actual DR backups running and improving the performance of the system first, and for now users can always recover files by asking an admin (who can look in the latest snapshot if we have continuous backups running).

We can revisit this at a later point if desired.