Page MenuHomePhabricator

Setup periodic snapshots of mysql replica disk
Closed, ResolvedPublic

Description

For wbstack.com this was done in https://github.com/wbstack/deploy/tree/main/gce/snapshots. Initially we are planning to mostly copy this. However rather than using these scripts we should do it using terraform.

These initial snapshots were maintained for a few days. We should make a copy daily and retain them for 7 days. These should also be kept outside of the geographical region the cluster is running in.

A/C:

  • snapshots of the data in the running mysql replica (secondary) pod
  • defined in terraform
  • copy daily and retain them for 7 days
  • kept outside of the geographical region the cluster
  • add the snapshot configuration for staging
  • add the snapshot configuration for production
  • document disk restore procedure (n.b. it is not necessary to fully execute these steps to practice them as part of this task)

Event Timeline

Way to manually generate a snapshot:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: new-snapshot-test
spec:
  volumeSnapshotClassName: csi-hostpath-snapclass
  source:
    persistentVolumeClaimName: data-sql-mariadb-primary-0
Tarrow removed Tarrow as the assignee of this task.Feb 9 2022, 2:12 PM
Tarrow subscribed.
Deniz_WMDE moved this task from Doing to Review on the Wikibase Cloud (Launch Sprint 6 (2022)) board.
Deniz_WMDE updated the task description. (Show Details)
Deniz_WMDE subscribed.

PR for production: https://github.com/wmde/wbaas-deploy/pull/190

removed the test a restore from the snapshot AC again because there is some unclear complexity involved

So, having looked and talked about this for a couple of days my general understanding is that there are still some rough edges to this approach.

Generally, I think the module and the code looks great and it's so far taught us a lot on how to work with these google resources and terraform. I will try to summarize some thoughts, problems and differences with the previous implementation.

  1. Deploying this to production seems to be stopped by pre-existing disks needing to be manually recreated as pointed out by @Deniz_WMDE here . For this to be an accepted solution we should probably still figure out why this is happening and how to work around it and possibly document it.
	module.wbaas-k8s-secrets.kubernetes_secret.smtp-credentials: Refreshing state... [id=default/smtp-credentials]
	╷
	│ Error: ForceNew: No changes for spec.0.persistent_volume_source.0.csi.0.volume_attributes.storage.kubernetes.io/csiProvisionerIdentity
	│ 
	│   with module.wbaas3-disks.kubernetes_persistent_volume.sql-replica,
	│   on ../../modules/disks/disk-sql-replica.tf line 13, in resource "kubernetes_persistent_volume" "sql-replica":
	│   13: resource "kubernetes_persistent_volume" "sql-replica" {
	│ 
	╵
  1. The "terraformation" of this feature seems to have put us in a place where the recovery process is unclear, which is not ideal. Are we supposed to use terraform for that too or do we use the previously proven to work instructions on the wbstack deploy repo? In the previous snapshot implementation setting up snapshots was based on running two bash scripts or manually using the ui to configure them once, and it seems that the recovery process is to some extent based on the same tooling (and some abuse of the replicaCount from the mariadb chart to provision new storage). With the provisioning of disks from terraform would it make this process even more complex?
  1. It is still unclear to me (and a worry made stronger by point 1) weather or not this would work in the scenario where we would need to re-create the whole cluster or if we would hit some bumps in the road that need manually swapping disks or something similar again. We haven't tested that and would probably not find out until we try it out because of the dependency to the google_compute_disk preventing this from being tested locally. I could be wrong but I think one of the strongest arguments for moving configuration to terraform is to reduce the need for manual steps. I think this however could to some extent be tested locally by provisioning PV:s using terraform the same way it's now done on staging and confirm that this works with setting up the database, hooking up the google layer with google_compute_disks and snapshot policies could then be something we only do on staging/production. Then again there should be nothing stopping us from just dropping staging and recreating it from scratch if we would think this is an important aspect to test.
  1. This implementation has made the dependency to google even stronger in code. Seeing that google is our service provider this on it's own shouldn't be a reason not to do this, but will make it harder for anyone to re-use the code on another k8s cluster. The nice thing about the previous implementation is that it was more or less disconnected from the k8s/terraform configuration and could be setup and restored from using mostly google tools. This dependency in terraform also has the drawback of drifting the minikube implementation and the live environments further apart.
  1. On several occasions during the development we've either accidentally deleted the disk or being prompted that the disk would be deleted with a single deploy command. This is a new risk that the previous implementation didn't have, there is AFAIK nothing destructive with manually setting up snapshots of google_compute_disk:s. Then again, we aren't taking snapshots of the primary storage so loosing the secondary during development or once during initial setup is more a hassle than a risk of loosing actual data.
  1. The recovery process hasn't been tested in it's current form and is also not documented in the wbaas-deploy repo.

Thanks for the summary @toan. To complete this task in a useful way which doesn't add too much complexity and uncertainty I propose to go back using the terraform data objects again (lookup instead of resource definitions), which should address most of these points. In the future I think we could evaluate again if using k8s snapshots would be a nicer approach instead of the cloud provider snapshots to achieve less dependency and keeping the local setup more in line with staging/prod.

The previous suggested approach was also quite the rabbit hole, therefore there is now just the snapshot policy in terraform and a bash script to attach it to the right disc in this PR https://github.com/wmde/wbaas-deploy/pull/194

Disk snapshots are now set up for the SQL replica in staging and production. The schedule policy lives in terraform and for now the attachment happens manually, but there is a helper script which tells you what to run. https://github.com/wmde/wbaas-deploy/blob/main/doc/disk-snapshots.md

toan updated the task description. (Show Details)
WMDE-leszek claimed this task.