Page MenuHomePhabricator

Investigate/prototype ceph backup options
Open, HighPublic

Description

In the near-term we're only going to put truly disposable 'cattle' instances on ceph. In the meantime, though, we should come up with some sort of backup/restore process.

It's true that we currently have no backups for VMs, but our current failure case is losing one Hypervisor worth of VMs, whereas with ceph we now run the risk of losing the whole cloud if ceph freaks out.

Quick summary of most recent conversation:

  • We probably want to use Backy2 for this. We might also use Benji; it has fancier compression but is a younger project.
  • For proof-of-concept (and possibly near-term production) we'll use cloudstore1008/9.
    • For full-scale backups we probably need new hardware, but will learn more about storage needs as we go.
  • Some users (e.g. https://www.reddit.com/r/ceph/comments/61nmfv/how_is_anyone_doing_backups_on_cephrbd/) have had trouble with Ceph freezing when capturing snapshots for backup.
    • For starters we're going to hope that that isn't a problem for us; if it is then we'll have to consider creating a mirrored cluster just for backup purposes.
      • Possibly that mirror can have only one replica rather than three, which might push it into affordability

For the first round of tests/experiments, I'd like to answer these questions:

  • Does the upstream backy .deb install on Buster?
  • Can we do this using local storage on cloudstores, or do we need it on NFS?
  • What are some rough numbers for how big a backup image is, relative to initial VM size?
    • Same question for incremental backups
  • Does Ceph misbehave for our users during the backup process?

Event Timeline

Bstorm added a subscriber: Bstorm.Wed, Jul 29, 8:40 PM

This is the slide deck from OVH at FOSDEM about how they ended up with ceph backing up to ceph https://archive.fosdem.org/2018/schedule/event/backup_ceph_at_scale/attachments/slides/2671/export/events/attachments/backup_ceph_at_scale/slides/2671/slides.pdf

It's good for reference because it describes their successes and failures in multiple backup system attempts.

Thre are some, largely unhelpful, recent discussions about that talk (as well as the link to the video) here https://www.reddit.com/r/ceph/comments/cznqoz/ceph_whole_cluster_backuprestore/

The basic long and short of it is that there's some things people are doing to save a bit by using radosgw, which doesn't give you as fast a hotswap cluster, but there it is. This also helps emphasize the not-that-great state of backups in ceph.

Also, if any solution requires a backup daemon set up by us, and we don't want to just write a python service, we could do something like how bacula does this: http://wiki.bacula.org/doku.php?id=application_specific_backups and use bacula since that's already a thing at the foundation, right?

Andrew triaged this task as High priority.Thu, Jul 30, 3:23 PM

Change 617841 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Add Backy2 module and profile

https://gerrit.wikimedia.org/r/617841

Thre are some, largely unhelpful, recent discussions about that talk (as well as the link to the video) here https://www.reddit.com/r/ceph/comments/cznqoz/ceph_whole_cluster_backuprestore/

The basic long and short of it is that there's some things people are doing to save a bit by using radosgw, which doesn't give you as fast a hotswap cluster, but there it is. This also helps emphasize the not-that-great state of backups in ceph.

Sorry for just dropping the message, but I though it might be interesting.
In that thread they also point out some other talks at Cephalocon, that might be interesting too. This one compares several ways of doing backups: https://static.sched.com/hosted_files/cephalocon2019/58/ceph2ceph-presentation169.pdf
From: https://ceph.io/cephalocon/barcelona-2019/

Change 617841 merged by Andrew Bogott:
[operations/puppet@production] Add Backy2 module and profile

https://gerrit.wikimedia.org/r/617841

Change 618842 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] backy2: fix up some dependency issues in install

https://gerrit.wikimedia.org/r/618842

Change 618842 merged by Andrew Bogott:
[operations/puppet@production] backy2: fix up some dependency issues in install

https://gerrit.wikimedia.org/r/618842

Change 618849 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Introduce role::wmcs::ceph::backup

https://gerrit.wikimedia.org/r/618849

Change 618853 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Introduce role::wmcs::ceph::backup

https://gerrit.wikimedia.org/r/618853

Change 618854 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Retool cloudvirt1004 and cloudvirt1006 as ceph/backy2 test hosts

https://gerrit.wikimedia.org/r/618854

Change 618849 abandoned by Andrew Bogott:
[operations/puppet@production] Introduce role::wmcs::ceph::backup

Reason:

https://gerrit.wikimedia.org/r/618849

Change 618853 merged by Andrew Bogott:
[operations/puppet@production] Introduce role::wmcs::ceph::backup

https://gerrit.wikimedia.org/r/618853

Change 618854 merged by Andrew Bogott:
[operations/puppet@production] Retool cloudvirt1004 and cloudvirt1006 as ceph/backy2 test hosts

https://gerrit.wikimedia.org/r/618854

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1004.eqiad.wmnet', 'cloudvirt1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008062058_andrew_22313.log.

Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1004.eqiad.wmnet', 'cloudvirt1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008062120_andrew_10703.log.

Completed auto-reimage of hosts:

['cloudvirt1006.eqiad.wmnet', 'cloudvirt1004.eqiad.wmnet']

Of which those FAILED:

['cloudvirt1006.eqiad.wmnet', 'cloudvirt1004.eqiad.wmnet']

Change 618875 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph/backy: add a bunch of keys needed for the ceph client config

https://gerrit.wikimedia.org/r/618875

Change 618875 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph/backy: add a bunch of keys needed for the ceph client config

https://gerrit.wikimedia.org/r/618875

Change 618876 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph: split out rbd client profiles

https://gerrit.wikimedia.org/r/618876

Change 618876 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph: split out rbd client profiles

https://gerrit.wikimedia.org/r/618876

Change 618878 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph/backy: remove reference to 'nova'

https://gerrit.wikimedia.org/r/618878

Change 618878 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph/backy: remove reference to 'nova'

https://gerrit.wikimedia.org/r/618878

Change 618879 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph/backup: remove another nova-specific ref

https://gerrit.wikimedia.org/r/618879

Change 618879 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph/backup: remove another nova-specific ref

https://gerrit.wikimedia.org/r/618879

Change 618995 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/dns@master] Added ipv6 addresses for cloudvirt1004 and cloudvir1006

https://gerrit.wikimedia.org/r/618995

Change 618995 merged by Andrew Bogott:
[operations/dns@master] Added ipv6 addresses for cloudvirt1004 and cloudvir1006

https://gerrit.wikimedia.org/r/618995

Andrew added a comment.Fri, Aug 7, 3:01 PM

In order to stand up the initial mysql db, need to apply this by hand before running initdb:

https://github.com/wamdam/backy2/pull/32/commits/589baa5d24abe0f88a8c430d66513386d83f4b13

Bstorm added a comment.Fri, Aug 7, 3:45 PM

In order to stand up the initial mysql db, need to apply this by hand before running initdb:

https://github.com/wamdam/backy2/pull/32/commits/589baa5d24abe0f88a8c430d66513386d83f4b13

Good grief. At least python doesn't need a recompile?

Change 619011 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/backy2/ceph: add admin keyring so backy can access things

https://gerrit.wikimedia.org/r/619011

Change 619011 merged by Andrew Bogott:
[operations/puppet@production] wmcs/backy2/ceph: add admin keyring so backy can access things

https://gerrit.wikimedia.org/r/619011

Change 619350 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph/backy: add basic backup script, wmcs-backup-instances

https://gerrit.wikimedia.org/r/619350

Change 619350 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph/backy: add basic backup script, wmcs-backup-instances

https://gerrit.wikimedia.org/r/619350

Change 619486 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] wmcs/ceph/backy fix name of backup script

https://gerrit.wikimedia.org/r/619486

Change 619486 merged by Andrew Bogott:
[operations/puppet@production] wmcs/ceph/backy fix name of backup script

https://gerrit.wikimedia.org/r/619486