Page MenuHomePhabricator

Decision request - Doing VM backups
Closed, ResolvedPublic

Description

Problem

Since we migrated the VMs to Ceph storage, we have been doing some backups for the VM disks in case that something went wrong (new technology).

Since then Ceph has proved stable and reliable, it has 3 copies of the data distributed in 3 different hosts.

There has not been many occasions (maybe a couple) on which we have used this backups, and the current setup needs a bit of work (<5 days) to get to a stable state.

Currently this backups are running on a few of the cloudvirts that got lots of spare space after moving to Ceph, but we have a couple new bare metal machines that were ordered to dedicate to these (and potentially other) backups.

Constraints and risks

  • Doing nothing will end up on the backups misbehaving (using extra space, filling up disk, leaking ceph snapshots, ...), so that's the less preferred option.
  • Not doing any backups (without an alternative) gives us no way of restoring users and our own VMs in cases of:
    • User mistake (deleting files, etc.)
    • Ceph issues (disk corruption, cluster mishap, etc.)
    • Disk corruption at the VM level (migration issue, OS issue, etc.)

Decision record

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_T304058_doing_VM_backups

Options

Option 1

Do nothing

Pros:

  • No new investment needed
  • No new hardware needed

Cons:

  • Maintenance:
    • the current scripts have issues, might fill up the disk (~1/6 months) and require manual attention (run one command + some broken backups)
    • the current scripts don't cleanup leaked ceph snapshots (no estimation, as it has not yet become an issue)
  • There's some overhead on a few of the cloudvirts (not noticeable so far)
  • We are not testing the backups, so there's a chance they might not work in the future (so far when we tried recovering they worked).

Option 2

Improve the current scripts and move to the backup bare metals.

Pros:

  • We still have VM backups off-ceph in case of disaster, for us and most of our users.

Cons:

  • We have to do an initial investment (<5days estimation)
  • Maintenance: we have to do an ongoing investment (maintain/test the backups, <1h/month estimation)
  • We are using the hardware (cloudbackups) for this an not other things (nothing else planned afaik, so just future potential)

Option 3

Stop doing VM backups

Pros:

  • No maintenance effort needed
  • Free hardware (cloudbackups)
  • Free space in ceph

Cons:

  • We have to do an initial investment (<5days estimation) to remove what we have
  • There's no safety net in case of disaster on ceph, user error (deleting files), OS error (filesystem corruption) or otherwise, for us or our users. That means that we might not be able to restore systems like toolforge in case of a disaster.
    • An alternative would require way more effort and time, though might be more future proof (we might still need to do some backups of things like databases, etcd, redis and similar data storage for critical components).

Event Timeline

dcaro renamed this task from Decision request template - doing VM backups to Decision request - Doing VM backups.Mar 17 2022, 10:57 AM
dcaro updated the task description. (Show Details)

Good backups are good. As noted above, good backups take a continuous effort. I would be concerned with telling the community that we are backing up their VPS project, as that would invoke an image of "If I delete everything, WMCS can put it all back." Which is not necessarily true. And making a more nuanced promise of "We back things up, but don't really pay attention to it, so if you ever need it back realize that there is some unknown chance that it really is not backed up." I feel that we should be more in the category of a web service provider, where we offer hardware and platform to work with, but we don't, officially, support what a community member does with it. I feel like this manages expectations a lot better than offering services that we're not likely to have the time to fully support. It would leave a sour taste in my mouth as a community member if I was told there was a service, and I found that, indeed, there was not when I needed it.

In the case of backups in particular, I feel we should avoid them internally, in most cases, as well. Which is to say we should not need to restore a system from backup, because that system should be defined in code. I'll trust github and the local gitlab to have good backups. While I realize a lot of community members won't do this by default, we should still lead by example. There are a few projects that are data heavy and do need backups, in those cases the project should backup locally to itself, we can grant them the disk they need in their VPS project.

I would be concerned with telling the community that we are backing up their VPS project

This is not under discussion (in this task), we are not offering this service and users should not expect us to.

In the case of backups in particular, I feel we should avoid them internally, in most cases, as well. Which is to say we should not need to restore a system from backup, because that system should be defined in code. I'll trust github and the local gitlab to have good backups. While I realize a lot of community members won't do this by default, we should still lead by example. There are a few projects that are data heavy and do need backups, in those cases the project should backup locally to itself, we can grant them the disk they need in their VPS project.

We still need to do backups of our own data in any case, this is solving that problem at the VM level, instead of having to backup the specifics of each datastore.
If we still want not to have VM backups, we would have to do backups of those sensitive data and restore them after recreating the environment (as those can not go in the code of "infrastructure as code", afaik, we can't dump databases in gerrit private repos).
I do completely agree that we should avoid them as much as possible, but I do see that the "as possible" might be at this point, at least, until we have the automation.

Let me know if you want me to add the option of "Automating the recreation of the environment and adding backup/restore facilities for the specific data in them." if you want, but I considered that to be a project by itself and a long time effort (not doable in the next year with the current goals).

Let me know if you want me to add the option of "Automating the recreation of the environment and adding backup/restore facilities for the specific data in them." if you want, but I considered that to be a project by itself and a long time effort (not doable in the next year with the current goals).

Btw. this was my position when I started, that's when I started building the toolforge* cookbooks and pushing the automation (https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Operational_Automation).

How much data do we have internally that benefits from backups? Or perhaps, what is that data?

How much data do we have internally that benefits from backups? Or perhaps, what is that data?

I don't have an exhaustive answer to this, given the "loose" definition of internally, there's some things we currently do backups of:

  • Tools/toolsbeta:
    • redis (not sure if this needs backups though, I don't remember what's there)
    • docker registry
    • elastic
    • puppetdb
    • prometheus
    • sge/grid masters -> No idea if this can be rebuilt from scratch with just nfs, though I highly doubt it.
  • cloudinfra
    • db03/04
  • cloudstore
    • No idea what this is
  • paws
    • etcd
    • puppetdb
    • nfs (not sure what's in here)
  • quarry
    • db
    • redis (probably just cache though)
    • nfs (not sure what's here either)
  • metricsinfra
    • prometheus

Some things we do *not* do backups of:

  • admin: this is an empty project
  • clouddb-services: this contains toolsdb
  • cloudinfra-nfs
  • wmflabsdotorg: no idea what it is
  • tools:
    • etcd

Doing something more exhaustive will take some time and effort (probably worth doing though).