Page MenuHomePhabricator

🟦️ Fix logical backup cronjob
Closed, ResolvedPublic

Description

Currently the logical backup cronjob fails to run. This needs investigation.

We noticed that the pod for doing these backups probably requests a lot more resources than needed, so maybe this PR could help https://github.com/wbstack/charts/pull/80

it fails on staging
https://console.cloud.google.com/kubernetes/cronjob/europe-west3-a/wbaas-2/default/sql-logic-backup/details?project=wikibase-cloud

for production it seems to run, but the backups are empty? https://console.cloud.google.com/storage/browser/wikibase-cloud-sql-backup;tab=objects?forceOnBucketsSortingFiltering=false&project=wikibase-cloud&prefix=&forceOnObjectsSortingFiltering=false

ACs:

  • logical backups run again daily on production
  • logical backups run again daily on staging

Event Timeline

For production the pod seems to get evicted because of ephemeral-storage not being specified in the resources we request when the job is running.

The site now has some data so it's already up to 18GB it seems.

On staging they seem to run fine

This morning the failed because 0/4 nodes are available: 4 Insufficient cpu. seems we should alter this too.

Updated https://github.com/wbstack/charts/pull/84 to specify a request as the docs say

If you specify a limit for a resource, but do not specify any request, and no admission-time mechanism has applied a default request for that resource, then Kubernetes copies the limit you specified and uses it as the requested value for the resource.

back to the drawing board. This now is not scheduled because of not ephemeral storage around, yay.

Probably need to setup some volume claim as described here.

https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/

So a few new patches.

https://github.com/wbstack/charts/pull/87 mounts a volume to be used as scratch space for taking backups
https://github.com/wmde/wbaas-backup/pull/15 moves the temp work dir and output to be under /backups

This is however based on the validation added here and that should get reviewed first https://phabricator.wikimedia.org/T306493

Ok to summarize what the problem was and what is proposed.

As the migration started the disk usage increased and the ephemeral storage provided by the nodes weren't enough for a long term solution.

Therefore we need to either give it more space or reduce the disk usage of the logical backups (could be done by just writing straight to the bucket but the already existing validation tasks are going in the opposite direction T306493: 🟦️ Add validation of backups to wbaas-backup, validate existance of temp files etc.)

Some initial attempts were done using the ephemeral storage provided by the nodes but this produced some different behavior if the restorePod was running or if the CronJob was using the same configuration. For the restore pod a new PV was created for the time with the requested size however for the CronJob this was not the case and the size was limited to what the Nodes could offer. Relying on the storage of the nodes felt overall like a less than ideal solution as we could be a bit more explicit with where this storage comes from.

The proposed solution is,

  1. Move all temporary writing to happen under /backups and add a cleanup script that removes temporary files after a backup is taken. https://github.com/wmde/wbaas-backup/pull/15
  2. Add a terraform managed compute disk that gets mounted by the job/restore-pod under /backups and do all the temp storage there. https://github.com/wmde/wbaas-deploy/pull/283
  3. Cut a new chart for wbaas-backup that uses this new version https://github.com/wbstack/charts/pull/87 (this still requires image bump + new image tag)
  4. Use the new chart on staging https://github.com/wmde/wbaas-deploy/pull/284
Tarrow subscribed.

So, after review @Tarrow fortunately wanted to give the ephemeral storage approach another go and found that this difference in behavior seems to be coming from the way manual jobs are scheduled. And seems after all it might work but depends on running the manual jobs from kubectl rather than the google ui.

image.png (855×1 px, 172 KB)

The left-most yaml is created by running a manual job from within the google UI, and this seems to just ignore the ephemeral storage part and instead use the emptyDir type

The right-most yaml is created when running a manual job using kubectl kubectl create job --from=cronjob/sql-logic-backup sql-logic-backup-manual-01

So, lets skip the terraform managed disk and use this + update the docs not to use the ui, might also be worth confirming that actual CronJobs in GKE also gets created correctly when scheduled.

might also be worth confirming that actual CronJobs in GKE also gets created correctly when scheduled.

Cronjob triggered by the schedule seems to also work correctly, this behavior is only from the UI. boo!

The proposed solution is,

  1. Move all temporary writing to happen under /backups and add a cleanup script that removes temporary files after a backup is taken. merged!
  2. Cut a new chart for wbaas-backup that uses this new version with generic ephemeral volume merged
  3. Use the new chart on staging/local https://github.com/wmde/wbaas-deploy/pull/284
WMDE-leszek renamed this task from Fix logical backup cronjob to 🟦️ Fix logical backup cronjob.May 17 2022, 7:17 PM

This was just deployed to production, and took a manual backup. The output can be validated by looking in wikibase-cloud-sql-backup bucket under cloud storage.

image.png (220×1 px, 30 KB)

We might need to size up the scratch disk space pretty soon again though depending on how big the data from next migration batches would be.