Page MenuHomePhabricator

Shall we drop the backy2 backup jobs
Closed, DeclinedPublic

Description

When we started doing backy2 backups, the goal was recovering VMs that were hard to replace and our glance images in the event of a ceph cluster disaster (since we caused one by trying to involve the wrong VLANs during a network renumbering).

Over time, backy2 has proven that it likes to occasionally explode, and it has exposed that some of our openstack APIs occasionally just return random errors or loss of auth (which is another problem). However, we've now introduced cinder (T269511: Attachable block storage in cloud-vps), which is intended to be more-persistent storage than VM disks, without backing it up and have had to remove VMs from the system over time. Ceph is now quite stable, but recovering it would require a proper mirror at this point anyway, which we don't have plans to implement so far.

I'm suggesting we should stop supporting the backy2 backups since they are becoming increasingly unhelpful, and everything we support is extra work we could be putting elsewhere. It was just meant as a stopgap until we were sure we'd stabilized ceph, anyway.

Event Timeline

Bstorm renamed this task from Shall we drop the backy2 backup jobs from ceph? to Shall we drop the backy2 backup jobs.Aug 19 2021, 5:49 PM
Bstorm moved this task from Inbox to Needs discussion on the cloud-services-team (Kanban) board.
nskaggs triaged this task as Medium priority.Aug 19 2021, 8:01 PM

Though I agree that the current setup is not as useful as it could be, I do think that we should find a replacement first, this might require some refining, some random ideas, feel free to ignore them:

Goals (as I understand them):

  1. Cloud VPS/toolforge (and maybe quarry/paws?) disaster recovery due to ceph breakage (deemed unlikely, so only for very critical VMs/volumes)
  2. Prevent disasters on the user side for any VPS project (ex. bad upgrade, filesystem corruption, bad puppet patch, rm -rf *, ...)
  3. Prevent disasters on the admin side for any VPS project (ex. removing a VM, bad puppet patch, ...)

Backup targets (so far, might expand in the future with sway/radosgw/s3 stuff):
a) critical/service VM images
b) non-critical/service VM images
c) critical/service cinder volumes (we have none right now afaik)
d) non-critical/service cinder volumes
e) glance images

Amount of infra needed:
S) small
M) medium
L) large

Amount of one-time effort needed (gauged):
S) small
M) medium
L) large
XL) extra large

Amount of continuous effort needed (gauged):
S) small
M) medium
L) large

Similar approach to current:
a) Back them up out of ceph and snapshots. (addresses 1/2/3, infra L, one-time L, continuous M)
b) Back them up as ceph snapshots. (addresses 2/3, infra M, one-time M, continuous M)
c) Back them up out of ceph and snapshots. (addresses 1/2/3, infra L, one-time L, continuous M)
d) Back them up as ceph snapshots. (addresses 2/3, infra M, one-time M, continuous M)
e) Back them out of ceph and snapshots. (addresses 1/2/3, infra L, one-time L, continuous M)

Rebuild approach:
a) Make sure to have an up-to-date and robust way of rebuilding the VMs (including no SPOF, this is ) (addresses 1/2/3, infra M, one-time XL, continuous M) -> this might be tricky and hard to achieve, or not ¯\_(ツ)_/¯
b) Back them up as ceph snapshots. (addresses 2/3, infra M, one-time M, continuous M)
c) Back them up out of ceph and snapshots. (addresses 1/2/3, infra L, one-time L, continuous M)
d) Back them up as ceph snapshots. (addresses 2/3, infra M, one-time M, continuous M)
e) Back them out of ceph and snapshots. (addresses 1/2/3, infra L, one-time L, continuous M

No backups approach:
No assurance that we can recover any VM/volume/image so increased downtime risk and effect, but no work or infra at all needed for it (infra S, one-time S, continuous S).

And of course, there's most probably many more options, specially if you get a better/different idea of the goals and backup targets.
Sorry for the long post.

I generally think that it's good to have backups. I'd rather we moved forward towards backing up cinder volumes as well as or instead of VMs.

Another path forward is to use swift and/or snapshots to allow users to effectively back up their own things as needed. I'm not sure I like that better than automatic universal backups though.

My concern here is that we never planned this as a viable backup service in the first place. This was a stopgap solution selected quickly in order to have a buffer in case our ceph layout was somehow inherently unstable--and we now are convinced it is not. We had no cinder storage and only had ephemeral storage and necessarily-poorly-designed VMs. We have never announced this or provided it as a direct service to users except in an incidental case where we deleted a VM once because it was not intended to exist. We did not plan staffing or hardware resourcing for this as a long term backup solution. We chose backy2 because it was the least firmly warned against option that existed.

Maintaining a system that wasn't intended to be permanent or intentionally planned is part of why the cloud easily burns out people and describes more of the setup than I'd like to think. If we go trying to design this out as a real service, we must also figure out how to staff and hardware resource it properly. We also should probably not be thinking of it in terms of VMs. Clouds where people are actually paying for the VMs don't provide VM backups.

The design of Openstack is overall intended to be that if it's not on cinder or swift, it's temporary (life of the VM). By backing up the VMs, we are kind of upending that and duplicating the glance images in many ways, so I don't think it's a good idea and encourages poor behavior to begin with. Our old build of Openstack necessitated bad practices, but we've been working in the right direction since. I'd like to build on that.

For user data, we have aimed for the past 3 years to provide a self-serve solution since we would otherwise need to provide a staffed service (which the NFS sort is, currently) (T209530). We bought the hardware for the previous vision for it, but since we have ceph, it seems a lot smarter to use snapshots, cinder and swift where we have quotas, multitenancy and self-serve out of the box. To complicate this, we are also users of this setup, and we should design backups of our own stuff that's in the cloud, but we should be doing that in ways that the users can use. When we go and just work around things with root, we make more work for ourselves and limit the help we can get from the community.

So that's part of my context for this. Backups are good, but these aren't the backups we are looking for. To my mind this is a question not of "should we have backups" but "should we keep trying to prevent this specific kind of disaster at this point".

Though I agree that the current setup is not as useful as it could be, I do think that we should find a replacement first, this might require some refining, some random ideas, feel free to ignore them:

<snip>
This is good thinking. I think we need to go through a planning process like this before we actually deploy a backup system, and that's part of why I kind of want to kill off this one. I don't think we should view it as needing replacing because it was really just a hack that won't scale as is and wasn't meant to be a real backup solution despite how much you and @Andrew managed to make it work like one. We've got software, but we don't have an actual plan or design for backups because we weren't really intending to do that. I'm interested if you think we should replace it first in the context of my rant above or if we are stable enough now to drop it so it stops absorbing disk space, work, alerts, etc and try to put time into the replacement besides that.

Goals (as I understand them):

  1. Cloud VPS/toolforge (and maybe quarry/paws?) disaster recovery due to ceph breakage (deemed unlikely, so only for very critical VMs/volumes)
  2. Prevent disasters on the user side for any VPS project (ex. bad upgrade, filesystem corruption, bad puppet patch, rm -rf *, ...)
  3. Prevent disasters on the admin side for any VPS project (ex. removing a VM, bad puppet patch, ...)

The goals for this system were so much less ambitious. It was really just "have a backout plan". It sort of became a lot more than that, but that's why I'm trying to stop and think about it here--especially with it causing space issues and with how much troubleshooting time it has absorbed.

My version of your above goals are:

  1. Establish a service continuity plan for Cloud VPS itself. This has not been done and was almost funded, but then it wasn't. I think a ceph mirror of some kind would be part of that, ideally, which appears to be what most other openstack service providers are doing.
  2. Provide self-serve tools for Cloud VPS users to keep their data safe and prevent VMs from becoming pets instead of cattle that are easily replaced.
  3. Use those tools for WMCS-maintained VPS projects to set up service continuity plans for those projects.

I feel like none of those are well-served by this setup, but that I am eager to prioritize doing them with a new setup. We could at least start the task since that doesn't change anyone's existing goals or plans, no?

I generally think that it's good to have backups. I'd rather we moved forward towards backing up cinder volumes as well as or instead of VMs.

What if we *only* backed up cinder volumes? I'm not sure I like that all that much either using this system since we aren't taking into account how the databases and such actually work (our backups would not be consistent state), but it would be closer to the practice of treating cinder like the valuable data and make it a better story for our users. I still have my concerns about this as a WMCS-provided service, though.

Another path forward is to use swift and/or snapshots to allow users to effectively back up their own things as needed. I'm not sure I like that better than automatic universal backups though.

I actually view the latter as the better option per my rants above. This means ceph backs up to ceph, of course....

boldly closing! Lots of interesting discussion here but the answer to the main question is 'no'