🟦️ Fix logical backup cronjob
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Deniz_WMDE
	Apr 29 2022, 10:53 AM

Description

Currently the logical backup cronjob fails to run. This needs investigation.

We noticed that the pod for doing these backups probably requests a lot more resources than needed, so maybe this PR could help https://github.com/wbstack/charts/pull/80

it fails on staging
https://console.cloud.google.com/kubernetes/cronjob/europe-west3-a/wbaas-2/default/sql-logic-backup/details?project=wikibase-cloud

for production it seems to run, but the backups are empty? https://console.cloud.google.com/storage/browser/wikibase-cloud-sql-backup;tab=objects?forceOnBucketsSortingFiltering=false&project=wikibase-cloud&prefix=&forceOnObjectsSortingFiltering=false

ACs:

logical backups run again daily on production
logical backups run again daily on staging

Related Objects

Mentioned Here: T306493: 🟦️ Add validation of backups to wbaas-backup

Event Timeline

Deniz_WMDE created this task.Apr 29 2022, 10:53 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 29 2022, 10:53 AM

• toan claimed this task.May 3 2022, 8:55 AM

• toan moved this task from Backlog to Doing on the Wikibase Cloud (Launch Migration Kanban (2022)) board.

For production the pod seems to get evicted because of ephemeral-storage not being specified in the resources we request when the job is running.

The site now has some data so it's already up to 18GB it seems.

https://github.com/wbstack/charts/pull/84 bumps it to 60GB by default

On staging they seem to run fine

https://github.com/wmde/wbaas-deploy/pull/262 for deploying

• toan removed • toan as the assignee of this task.May 3 2022, 10:44 AM

• toan moved this task from Doing to Blocked/Stalled on the Wikibase Cloud (Launch Migration Kanban (2022)) board.

• toan subscribed.

• toan moved this task from Blocked/Stalled to Review on the Wikibase Cloud (Launch Migration Kanban (2022)) board.May 3 2022, 4:08 PM

This morning the failed because 0/4 nodes are available: 4 Insufficient cpu. seems we should alter this too.

Updated https://github.com/wbstack/charts/pull/84 to specify a request as the docs say

If you specify a limit for a resource, but do not specify any request, and no admission-time mechanism has applied a default request for that resource, then Kubernetes copies the limit you specified and uses it as the requested value for the resource.

• toan assigned this task to Rosalie_WMDE.May 4 2022, 8:40 AM

Rosalie_WMDE removed Rosalie_WMDE as the assignee of this task.May 4 2022, 9:59 AM

Rosalie_WMDE moved this task from Review to Deploy To Production on the Wikibase Cloud (Launch Migration Kanban (2022)) board.

Rosalie_WMDE subscribed.

This thing still needs a +2 https://github.com/wmde/wbaas-deploy/pull/262

back to the drawing board. This now is not scheduled because of not ephemeral storage around, yay.

Probably need to setup some volume claim as described here.

https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/

So a few new patches.

~~https://github.com/wbstack/charts/pull/87 mounts a volume to be used as scratch space for taking backups~~
~~https://github.com/wmde/wbaas-backup/pull/15 moves the temp work dir and output to be under /backups~~

This is however based on the validation added here and that should get reviewed first https://phabricator.wikimedia.org/T306493

• toan moved this task from Blocked/Stalled to Doing on the Wikibase Cloud (Launch Migration Kanban (2022)) board.May 5 2022, 12:53 PM

Ok to summarize what the problem was and what is proposed.

As the migration started the disk usage increased and the ephemeral storage provided by the nodes weren't enough for a long term solution.

Therefore we need to either give it more space or reduce the disk usage of the logical backups (could be done by just writing straight to the bucket but the already existing validation tasks are going in the opposite direction T306493: 🟦️ Add validation of backups to wbaas-backup, validate existance of temp files etc.)

Some initial attempts were done using the ephemeral storage provided by the nodes but this produced some different behavior if the restorePod was running or if the CronJob was using the same configuration. For the restore pod a new PV was created for the time with the requested size however for the CronJob this was not the case and the size was limited to what the Nodes could offer. Relying on the storage of the nodes felt overall like a less than ideal solution as we could be a bit more explicit with where this storage comes from.

The proposed solution is,

Move all temporary writing to happen under /backups and add a cleanup script that removes temporary files after a backup is taken. https://github.com/wmde/wbaas-backup/pull/15
Add a terraform managed compute disk that gets mounted by the job/restore-pod under /backups and do all the temp storage there. https://github.com/wmde/wbaas-deploy/pull/283
Cut a new chart for wbaas-backup that uses this new version https://github.com/wbstack/charts/pull/87 (this still requires image bump + new image tag)
Use the new chart on staging https://github.com/wmde/wbaas-deploy/pull/284

• toan removed • toan as the assignee of this task.May 10 2022, 1:29 PM

• toan moved this task from Doing to Review on the Wikibase Cloud (Launch Migration Kanban (2022)) board.

Tarrow claimed this task.May 12 2022, 2:12 PM

Tarrow removed Tarrow as the assignee of this task.May 13 2022, 9:15 AM

Tarrow moved this task from Review to Doing on the Wikibase Cloud (Launch Migration Kanban (2022)) board.

Tarrow subscribed.

• toan claimed this task.May 13 2022, 9:21 AM

So, after review @Tarrow fortunately wanted to give the ephemeral storage approach another go and found that this difference in behavior seems to be coming from the way manual jobs are scheduled. And seems after all it might work but depends on running the manual jobs from kubectl rather than the google ui.

The left-most yaml is created by running a manual job from within the google UI, and this seems to just ignore the ephemeral storage part and instead use the emptyDir type

The right-most yaml is created when running a manual job using kubectl kubectl create job --from=cronjob/sql-logic-backup sql-logic-backup-manual-01

So, lets skip the terraform managed disk and use this + update the docs not to use the ui, might also be worth confirming that actual CronJobs in GKE also gets created correctly when scheduled.

might also be worth confirming that actual CronJobs in GKE also gets created correctly when scheduled.

Cronjob triggered by the schedule seems to also work correctly, this behavior is only from the UI. boo!

The proposed solution is,

Move all temporary writing to happen under /backups and add a cleanup script that removes temporary files after a backup is taken. merged!
Cut a new chart for wbaas-backup that uses this new version with generic ephemeral volume merged
Use the new chart on staging/local https://github.com/wmde/wbaas-deploy/pull/284

• toan removed • toan as the assignee of this task.May 13 2022, 10:34 AM

• toan moved this task from Doing to Review on the Wikibase Cloud (Launch Migration Kanban (2022)) board.

Rosalie_WMDE claimed this task.May 17 2022, 8:25 AM

WMDE-leszek renamed this task from Fix logical backup cronjob to 🟦️ Fix logical backup cronjob.May 17 2022, 7:17 PM

Tarrow claimed this task.May 18 2022, 10:31 AM

Tarrow moved this task from Review to To Release on the Wikibase Cloud (Launch Migration Kanban (2022)) board.May 18 2022, 10:36 AM

• toan moved this task from To Release to Deploy To Staging on the Wikibase Cloud (Launch Migration Kanban (2022)) board.May 18 2022, 11:17 AM

• toan moved this task from Deploy To Staging to Deploy To Production on the Wikibase Cloud (Launch Migration Kanban (2022)) board.May 18 2022, 12:41 PM

• toan moved this task from Deploy To Production to Product Verification on the Wikibase Cloud (Launch Migration Kanban (2022)) board.May 19 2022, 8:48 AM

This was just deployed to production, and took a manual backup. The output can be validated by looking in wikibase-cloud-sql-backup bucket under cloud storage.

We might need to size up the scratch disk space pretty soon again though depending on how big the data from next migration batches would be.

• toan moved this task from Product Verification to Done on the Wikibase Cloud (Launch Migration Kanban (2022)) board.May 19 2022, 11:47 AM

Tarrow closed this task as Resolved.Jun 23 2022, 9:14 PM

	F35153888: image.png
	May 19 2022, 9:09 AM

	F35137902: image.png
	May 13 2022, 9:30 AM

🟦️ Fix logical backup cronjobClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

🟦️ Fix logical backup cronjob
Closed, ResolvedPublic
Actions