Setup periodic snapshots of mysql replica disk
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Addshore
	Jan 7 2022, 7:24 PM

Description

For wbstack.com this was done in https://github.com/wbstack/deploy/tree/main/gce/snapshots. Initially we are planning to mostly copy this. However rather than using these scripts we should do it using terraform.

These initial snapshots were maintained for a few days. We should make a copy daily and retain them for 7 days. These should also be kept outside of the geographical region the cluster is running in.

A/C:

snapshots of the data in the running mysql replica (secondary) pod
defined in terraform
copy daily and retain them for 7 days
kept outside of the geographical region the cluster
add the snapshot configuration for staging
add the snapshot configuration for production
document disk restore procedure (n.b. it is not necessary to fully execute these steps to practice them as part of this task)

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Tarrow	T299505 🟦️ [EPIC] Defined and tested disaster recovery / backup plan for Wikibase.cloud (closed beta launch version)
		Resolved		WMDE-leszek	T298799 Setup periodic snapshots of mysql replica disk

Event Timeline

Addshore created this task.Jan 7 2022, 7:24 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 7 2022, 7:24 PM

Jakob_WMDE claimed this task.Jan 10 2022, 1:24 PM

Jakob_WMDE moved this task from Ready to Pick Up to Doing on the Wikibase Cloud board.

Addshore moved this task from Doing to Launch Sprint 1 (2022) on the Wikibase Cloud board.Jan 13 2022, 11:42 AM

Addshore edited projects, added Wikibase Cloud (Launch Sprint 1 (2022)); removed Wikibase Cloud.

Addshore triaged this task as Low priority.Jan 13 2022, 12:28 PM

Addshore moved this task from Sprint Backlog to Doing on the Wikibase Cloud (Launch Sprint 1 (2022)) board.

Jakob_WMDE removed Jakob_WMDE as the assignee of this task.Jan 19 2022, 11:07 AM

Jakob_WMDE subscribed.

Tarrow claimed this task.Jan 19 2022, 11:07 AM

Way to manually generate a snapshot:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: new-snapshot-test
spec:
  volumeSnapshotClassName: csi-hostpath-snapclass
  source:
    persistentVolumeClaimName: data-sql-mariadb-primary-0

• Samantha_Alipio_WMDE edited projects, added Wikibase Cloud; removed Wikibase Cloud (Launch Sprint 1 (2022)).Jan 26 2022, 10:18 AM

• Samantha_Alipio_WMDE moved this task from Launch Sprint 1 (2022) to Ready to Pick Up on the Wikibase Cloud board.

Tarrow moved this task from Ready to Pick Up to Backlog (incoming) on the Wikibase Cloud board.Feb 8 2022, 4:08 PM

Tarrow removed Tarrow as the assignee of this task.Feb 9 2022, 2:12 PM

Tarrow subscribed.

• Samantha_Alipio_WMDE moved this task from Backlog (incoming) to Launch Backlog (Product Prioritized) on the Wikibase Cloud board.Mar 10 2022, 12:47 PM

• Samantha_Alipio_WMDE moved this task from Launch Backlog (Product Prioritized) to Ready for Breakdown on the Wikibase Cloud board.

Tarrow added a parent task: T299505: 🟦️ [EPIC] Defined and tested disaster recovery / backup plan for Wikibase.cloud (closed beta launch version).Mar 10 2022, 2:15 PM

Tarrow edited projects, added Wikibase Cloud (Launch Sprint 5 (2022)); removed Wikibase Cloud.Mar 10 2022, 2:19 PM

Tarrow updated the task description. (Show Details)Mar 10 2022, 3:04 PM

Deniz_WMDE claimed this task.Mar 15 2022, 12:49 PM

Deniz_WMDE moved this task from Sprint Backlog to Doing on the Wikibase Cloud (Launch Sprint 5 (2022)) board.

PR: https://github.com/wmde/wbaas-deploy/pull/181

WMDE-leszek moved this task from Launch Sprint 5 (2022) to Launch Sprint 6 (2022) on the Wikibase Cloud board.Mar 23 2022, 1:09 PM

WMDE-leszek edited projects, added Wikibase Cloud (Launch Sprint 6 (2022)); removed Wikibase Cloud (Launch Sprint 5 (2022)).

WMDE-leszek moved this task from Sprint Backlog to Review on the Wikibase Cloud (Launch Sprint 6 (2022)) board.

WMDE-leszek moved this task from Review to Doing on the Wikibase Cloud (Launch Sprint 6 (2022)) board.

Deniz_WMDE updated the task description. (Show Details)Mar 28 2022, 12:49 PM

second PR: https://github.com/wmde/wbaas-deploy/pull/189

Deniz_WMDE updated the task description. (Show Details)Mar 28 2022, 4:01 PM

Deniz_WMDE updated the task description. (Show Details)

Deniz_WMDE removed Deniz_WMDE as the assignee of this task.Mar 29 2022, 6:01 PM

Deniz_WMDE moved this task from Doing to Review on the Wikibase Cloud (Launch Sprint 6 (2022)) board.

Deniz_WMDE updated the task description. (Show Details)

Deniz_WMDE subscribed.

PR for production: https://github.com/wmde/wbaas-deploy/pull/190

removed the test a restore from the snapshot AC again because there is some unclear complexity involved

So, having looked and talked about this for a couple of days my general understanding is that there are still some rough edges to this approach.

Generally, I think the module and the code looks great and it's so far taught us a lot on how to work with these google resources and terraform. I will try to summarize some thoughts, problems and differences with the previous implementation.

Deploying this to production seems to be stopped by pre-existing disks needing to be manually recreated as pointed out by @Deniz_WMDE here . For this to be an accepted solution we should probably still figure out why this is happening and how to work around it and possibly document it.

	module.wbaas-k8s-secrets.kubernetes_secret.smtp-credentials: Refreshing state... [id=default/smtp-credentials]
	╷
	│ Error: ForceNew: No changes for spec.0.persistent_volume_source.0.csi.0.volume_attributes.storage.kubernetes.io/csiProvisionerIdentity
	│ 
	│   with module.wbaas3-disks.kubernetes_persistent_volume.sql-replica,
	│   on ../../modules/disks/disk-sql-replica.tf line 13, in resource "kubernetes_persistent_volume" "sql-replica":
	│   13: resource "kubernetes_persistent_volume" "sql-replica" {
	│ 
	╵

The "terraformation" of this feature seems to have put us in a place where the recovery process is unclear, which is not ideal. Are we supposed to use terraform for that too or do we use the previously proven to work instructions on the wbstack deploy repo? In the previous snapshot implementation setting up snapshots was based on running two bash scripts or manually using the ui to configure them once, and it seems that the recovery process is to some extent based on the same tooling (and some abuse of the replicaCount from the mariadb chart to provision new storage). With the provisioning of disks from terraform would it make this process even more complex?

It is still unclear to me (and a worry made stronger by point 1) weather or not this would work in the scenario where we would need to re-create the whole cluster or if we would hit some bumps in the road that need manually swapping disks or something similar again. We haven't tested that and would probably not find out until we try it out because of the dependency to the google_compute_disk preventing this from being tested locally. I could be wrong but I think one of the strongest arguments for moving configuration to terraform is to reduce the need for manual steps. I think this however could to some extent be tested locally by provisioning PV:s using terraform the same way it's now done on staging and confirm that this works with setting up the database, hooking up the google layer with google_compute_disks and snapshot policies could then be something we only do on staging/production. Then again there should be nothing stopping us from just dropping staging and recreating it from scratch if we would think this is an important aspect to test.

This implementation has made the dependency to google even stronger in code. Seeing that google is our service provider this on it's own shouldn't be a reason not to do this, but will make it harder for anyone to re-use the code on another k8s cluster. The nice thing about the previous implementation is that it was more or less disconnected from the k8s/terraform configuration and could be setup and restored from using mostly google tools. This dependency in terraform also has the drawback of drifting the minikube implementation and the live environments further apart.

On several occasions during the development we've either accidentally deleted the disk or being prompted that the disk would be deleted with a single deploy command. This is a new risk that the previous implementation didn't have, there is AFAIK nothing destructive with manually setting up snapshots of google_compute_disk:s. Then again, we aren't taking snapshots of the primary storage so loosing the secondary during development or once during initial setup is more a hassle than a risk of loosing actual data.

The recovery process hasn't been tested in it's current form and is also not documented in the wbaas-deploy repo.

• toan updated the task description. (Show Details)Mar 31 2022, 8:14 AM

Deniz_WMDE claimed this task.Mar 31 2022, 8:19 AM

Deniz_WMDE moved this task from Review to Doing on the Wikibase Cloud (Launch Sprint 6 (2022)) board.

Thanks for the summary @toan. To complete this task in a useful way which doesn't add too much complexity and uncertainty I propose to go back using the terraform data objects again (lookup instead of resource definitions), which should address most of these points. In the future I think we could evaluate again if using k8s snapshots would be a nicer approach instead of the cloud provider snapshots to achieve less dependency and keeping the local setup more in line with staging/prod.

The previous suggested approach was also quite the rabbit hole, therefore there is now just the snapshot policy in terraform and a bash script to attach it to the right disc in this PR https://github.com/wmde/wbaas-deploy/pull/194

Deniz_WMDE removed Deniz_WMDE as the assignee of this task.Mar 31 2022, 4:19 PM

Deniz_WMDE moved this task from Doing to Review on the Wikibase Cloud (Launch Sprint 6 (2022)) board.

• toan claimed this task.Apr 1 2022, 7:07 AM

• toan removed • toan as the assignee of this task.Apr 1 2022, 7:37 AM

• toan moved this task from Review to Deploy To Staging on the Wikibase Cloud (Launch Sprint 6 (2022)) board.Apr 1 2022, 10:31 AM

• toan moved this task from Deploy To Staging to Product Verification on the Wikibase Cloud (Launch Sprint 6 (2022)) board.Apr 1 2022, 12:51 PM

• toan assigned this task to Deniz_WMDE.Apr 1 2022, 12:55 PM

• toan moved this task from Product Verification to Doing on the Wikibase Cloud (Launch Sprint 6 (2022)) board.

Deniz_WMDE updated the task description. (Show Details)Apr 1 2022, 1:03 PM

Deniz_WMDE updated the task description. (Show Details)

Disk snapshots are now set up for the SQL replica in staging and production. The schedule policy lives in terraform and for now the attachment happens manually, but there is a helper script which tells you what to run. https://github.com/wmde/wbaas-deploy/blob/main/doc/disk-snapshots.md

Tarrow updated the task description. (Show Details)Apr 6 2022, 9:16 AM

WMDE-leszek edited projects, added Wikibase Cloud (Launch Sprint 7 (2022)); removed Wikibase Cloud (Launch Sprint 6 (2022)).Apr 6 2022, 9:38 AM

WMDE-leszek moved this task from Sprint Backlog to Doing on the Wikibase Cloud (Launch Sprint 7 (2022)) board.Apr 6 2022, 9:38 AM

Deniz_WMDE removed Deniz_WMDE as the assignee of this task.Apr 6 2022, 5:13 PM

Deniz_WMDE moved this task from Doing to Review on the Wikibase Cloud (Launch Sprint 7 (2022)) board.

https://github.com/wmde/wbaas-deploy/pull/197

• toan claimed this task.Apr 7 2022, 7:25 AM

• toan removed • toan as the assignee of this task.Apr 7 2022, 9:13 AM

• toan updated the task description. (Show Details)

• toan moved this task from Review to Product Verification on the Wikibase Cloud (Launch Sprint 7 (2022)) board.

• toan moved this task from Product Verification to Done on the Wikibase Cloud (Launch Sprint 7 (2022)) board.Apr 8 2022, 7:01 AM

WMDE-leszek closed this task as Resolved.Apr 20 2022, 12:19 PM