Backup systems (tracking)
Closed, ResolvedPublic

Description

Author: elubarsky

Description:
I think there should be a bug to track updates to the docs and eventual implementations.

IMHO we need to make sure text & images are protected from hardware faults, accidental software errors (which can replicate..), and even malicious deletion/corruption by someone gaining access through a security breach.


Version: unspecified
Severity: major
URL: http://wikitech.wikimedia.org/view/Backup_procedures

Details

Reference
bz18255
bzimport set Reference to bz18255.
bzimport added a subscriber: Unknown Object (MLST).
brion added a comment.Apr 22 2009, 4:48 PM

Assigning fun sysadmin task to Fred, adding CCs for Rob and Tomasz.

The backup procedures/status chart at http://wikitech.wikimedia.org/view/Backup_procedures needs to be audited; it hasn't been updated in several months. Some services have been moved to new servers, and automatic backup jobs might not have been updated; others were last seen still pending full offsite backups, or had only ad-hoc procedures.

We want to make sure all the empty or red spots are filled in and cleaned up, and that we have a good idea how to recover from loss of any one of these services.

Tfinc added a comment.Apr 22 2009, 5:09 PM

I'd love to have a status page for the backups that's updated regularly. Should be easy enough with a call to the mediawiki write api upon success or failure.

brion added a comment.Apr 22 2009, 5:22 PM

Amen, brother!

(Should this also be integrated to Nagios monitoring/alerts?)

fvassard wrote:

I am currently evaluating a free solution called Amanda to properly take care of backups.
I believe this solution would also offer easy notification of successes / failures and enable us to have the dashboard Tomasz mentioned.
Also 'waiting' on storage to be available, which should theoretically arrive once I have had enough time to work through Amanda's configuration setup.
More to come!

--Fred.

elubarsky wrote:

Considering how valuable and high-quality this data is (and the massive effort the community has put into it), I'm rather concerned that there haven't been many concrete updates to this bug and pretty much none at all to http://wikitech.wikimedia.org/view/Backup_procedures.. From that page, images and OTRS data seem to be quite vulnerable even though they're are obviously very important to the project. Could we at least have some re-assurance?

Adding 'tracking' keyword.

Giving half of Fred's old bugs to Ashar since I trust him to get it done or reassign if he doesn't have time.

hexmode>
Tasks coming to mind:

  • which item needs to be backuped this bug report provides some
  • identify for each item:
    • a backup solution (rsync, ftp, NAS, ..), incremental, snapshots, both?
    • off-site, off-foundation requirements
    • backup frequency (daily, weekly, monthly).
    • retention duration (month, year ...)
  • make sure the backup tasks are documented and monitored
  • make sure operations know where the documents are and train them
  • actually test backup procedures and data from time to time on a test server. This is to ensure the tools and procedures are up-to date. There is nothing worse than a bad backup (data filled with zeroes for examples) or wasting 2 hours looking for the documentation / hacking scripts around.

I believe this should be raised to CT woo and assigned to an operation program manager.

Wikitech has some documentations:
https://wikitech.wikimedia.org/view/Backup_procedures

content hidden as private in Bugzilla

Reseting this bug to default assignee. There is not really anything I could do, that is an operation team issue.

Not sure why this report was marked as "tracking" if nothing has been ever tracked here (no dependency bug reports), plus scope unclear: What would be needed to get this fixed? Sounds rather like a continuous task (not fixable).

(In reply to comment #11)

Not sure why this report was marked as "tracking" if nothing has been ever
tracked here (no dependency bug reports),

Probably, because we never found someone bothering to create/track the relevant subtasks.

Backup system is not tracked in Bugzilla anymore. Ops team use RT and has different set of tool to plan their backup strategy, this bug report is hence rather useless and I am closing it.

elubarsky wrote:

Hi Antoine, could you post a link to where this is now tracked?

This report was meant as a "tracking" report, but instead random different requests were added in this bug report, instead of marking them as dependencies. So to me it's already unclear what you'd actually like to "track".

FYI, RT is located at https://rt.wikimedia.org/ (refer to bug 30413 for potential meta-discussions about RT itself).

elubarsky wrote:

A few years ago the situation at https://wikitech.wikimedia.org/wiki/Backup_procedures wasn't nearly as good as it is now. In particular there weren't proper disconnected off-site backups for text & especially images. So I started this report "to make sure text & images are protected from hardware faults, accidental software errors (which can replicate..), and even malicious deletion/corruption by someone gaining access through a security breach."

It's great to see that much progress has been made (thanks to everyone at Wikimedia for that!), but would you say that the risks in my original concern are now minimised as much as possible? I think matters relating to the progress of backup systems should be kept public since they are obviously of concern to Wikipedia editors who'd like their considerable efforts to not get lost.

Add Comment