Page MenuHomePhabricator

Disaster recovery for k8s upgrade
Closed, ResolvedPublic5 Estimated Story Points

Description

We want to upgrade k8s. We want to do our best to ensure that we don't clobber all the environments running on patchdemo. We should make a disaster recovery plan.

Plan on this task should include:

  • What data needs to be backed up? (snapshots of volumes?)
    • Do the backup
  • Write: How will we use the backups if needed

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
back up K3s cluster data every night at 3:30 UTCrepos/test-platform/catalyst/catalyst-tofu!37jnucheT419580main
Preliminary steps for the creation of K3s data volumes backupsrepos/test-platform/catalyst/catalyst-tofu!36jnucheT419580main
Customize query in GitLab

Event Timeline

We know from T405224 that we can recover the entire cluster from the K3s data volumes. Additionally, the folks over at Cloud have told us in the past that it's not possible to automate/schedule the creation of OpenStack snapshots natively.

My proposal is:

  • Increase disk quota capacity to allow us to create backup volumes for all data volumes
  • Scheduled backups: Create systemd timer that rsyncs the data volumes every night from data volumes to backup volumes
  • Recovery: Rsync back from backup volumes to data volumes

The timer will have as usual a separate service unit that does the actual work. This service can be one-shot'd to create a backup on demand prior to e.g. the K8s cluster upgrade we're planning

Looks like we'll need an extra 350G of volume space to do this + the tofu work to create the systemd timers. Let's double check volume space and request the change when we're ready.

jnuche set the point value for this task to 5.
jnuche moved this task from Backlog to In progress on the Catalyst (Luka Ijo Pimeja Jan) board.

Disk quotas for both catalyst and catalyst-dev will need to be raised by 320GB from the current 1200GB to a total of 1520GB

jnuche moved this task from In progress to Done on the Catalyst (Luka Ijo Pimeja Jan) board.

Tested on catalyst-dev:

  • Created a new pod "A"
  • Used the new service to create backups on all hosts
  • Created a new pod "B"
  • Messed up the cluster by deleting several cluster directories on two of the nodes, including /mnt/k3s-data/k3s/data and /mnt/k3s-data/k3s/server on the primary host
  • Stopped all K3s systemd services across the cluster
  • Rsync'd back from the backups on all hosts
  • Restarted systemd services
  • Cluster is healthy: Pod "A" is back and pod "B" is not