Logical backups keep failing
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Deniz_WMDE
	Oct 4 2023, 2:21 PM

Description

Today I noticed that the logical backups keep failing again on production, backup buckets are empty because of the 7 day retain policy.

The last sql-logic-backup pods that are completed and are still around on the prod cluster are from Sep 19th/20
The oldest failing one that is still around is from Oct 3rd, the youngest one from today (Oct 4th)

~~I suspect that the currently available ephemeral storage is yet again to little, since the backup keeps growing in size (currently allocatable are 26.78 GB out of 63.16 GB).~~

ACs:

Logical SQL backups work again
- E-Mail alerting for failed logical backups is in place
- E-Mail alerting for logical backup disk space below threshold

PRs:

~~increase disk size - https://github.com/wmde/wbaas-deploy/pull/1179~~
increase scratch volume size - https://github.com/wmde/wbaas-deploy/pull/1181
~~disk usage log: https://github.com/wmde/wbaas-backup/pull/27~~
disk usage metric & policy https://github.com/wmde/wbaas-deploy/pull/1185
deploy to staging https://github.com/wmde/wbaas-deploy/pull/1190
deploy to production https://github.com/wmde/wbaas-deploy/pull/1191

Related Objects

Mentioned In: T349817: False alarms for SQL backups failing

Event Timeline

Deniz_WMDE created this task.Oct 4 2023, 2:21 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 4 2023, 2:21 PM

Logs of the failing pods look like this:

kubectl logs sql-logic-backup-28271520-sk5df
/backups/tmp/backup-2023-10-03_000018/metadata exists!
/backups/tmp/backup-2023-10-03_000018/apidb.users.sql exists!
/backups/tmp/backup-2023-10-03_000018/apidb.users-schema.sql exists!
/backups/tmp/backup-2023-10-03_000018/apidb.wikis.sql exists!
/backups/tmp/backup-2023-10-03_000018/apidb.wikis-schema.sql exists!
/backups/tmp/backup-2023-10-03_000018/mysql.db.sql exists!
/backups/tmp/backup-2023-10-03_000018/mysql.db-schema.sql exists!
error writing output file
tar: -: Cannot write: Broken pipe
tar: Child returned status 141
tar: Error is not recoverable: exiting now

Where the third to last line differs sometimes to a message like tar: -: Wrote only 4096 of 10240 bytes (sometimes 2048 or 8192 instead of 4096).

Deniz_WMDE updated the task description. (Show Details)Oct 4 2023, 2:33 PM

Deniz_WMDE updated the task description. (Show Details)Oct 4 2023, 3:26 PM

Deniz_WMDE updated the task description. (Show Details)

I wrote an ADR to address the situation we found ourselves in again, where we don't have any logical sql backups because each file gets deleted after 7 days, no matter what. We could easily get around that with some minor changes

Scratched this, since it allows for corrupt backups to pile up the versioning. We could think about increasing the backup retention period though.

Deniz_WMDE updated the task description. (Show Details)Oct 5 2023, 8:07 AM

Deniz_WMDE updated the task description. (Show Details)

Increasing the scratch disk size worked out: https://github.com/wmde/wbaas-deploy/pull/1181

I looked at the current disk space that is used: 41GB SQL files get compressed to an 25GB encrypted archive, this requires a scratch disk of at least 66GB, which explains why 64GB was not enough. (compression rate of approximately 60%)

For some reason no incidents and thus no alerts were triggered, while we can see here that all conditions were met. (link)

I tested the current alert configuration by briefly setting the duration of the condition_absent in terraform to 3600s (1 hour) - this immediately triggered several incidents and alert mails. This leaves me completely puzzled as to why this did not work for production in the last few weeks.

Deniz_WMDE updated the task description. (Show Details)Oct 5 2023, 5:45 PM

This PR introduces a new alert policy for giving us a heads up in case the backup scratch volume runs low on disk space. https://github.com/wmde/wbaas-deploy/pull/1185

As of right now it can't be applied via terraform, because of one of the filter rule used: resource.label."pod_name" = starts_with("sql-logic-backup-"). In the GCP Metrics Explorer it looks like it's looking for a pod with the literal name starts_with("sql-logic-backup-"), which is of course not the intention. (note that this was taken from the docs).
Clicking the filter together manually in the Metrics Explorer via an requals regex comparison (=~) and switching over to the MQL view, we see it reads resource.pod_name =~ 'sql-logic-backup-.*'. But trying to apply this via terraform throws an error in the GCP API, it's not happy with the =~ operator:

│ Error: Error creating AlertPolicy: googleapi: Error 400: Field alert_policy.conditions[0].condition_threshold.filter had an invalid value of "metric.type="kubernetes.io/pod/volume/utilization"
│ resource.type="k8s_pod"
│ resource.label."cluster_name"="wbaas-2"
│ resource.pod_name=~'sql-logic-backup-.*'
│ ": Could not parse filter "metric.type=\"kubernetes.io/pod/volume/utilization\"\nresource.type=\"k8s_pod\"\nresource.label.\"cluster_name\"=\"wbaas-2\"\nresource.pod_name=~\'sql-logic-backup-.*\'\n"; syntax error at line 4, column 18, token '=~'

Today I managed to get the new alert working. Part of my struggles can only be explained by oversights and the complexities of testing these policies. In other words, even this commit from two days ago seems like it worked.

About the missing incidents for the "failing backups" alerting policy, I think I could reproduce the issue by suspending the cronjob yesterday (I resumed it today & manually triggered a backup), as can be seen here:

Notice the last two data points: a scheduled backup that finished yesterday morning 3 AM vs the one that I manually triggered and finished 12PM). The time difference between those two exceeds 24h 30m (or 88200s), but no incident spawned.

When looking at this policy in the GCP UI, we can see that the trigger absence time is set to 5 minutes. By editing this field and trying to input 88200 seconds, we can see a warning that tells us that the maximum value can only be 1 day:

This is still confusing because this should trigger the alerting more than ever, but it didn't. I'll set this to 1 day and repeat the test described above.

Deniz_WMDE updated the task description. (Show Details)Oct 11 2023, 6:05 PM

Setting the duration to 86400s (or 24h) worked, it triggered an alert and incident in Google Cloud Monitoring.

https://github.com/wmde/wbaas-deploy/pull/1185

Deniz_WMDE removed Deniz_WMDE as the assignee of this task.Oct 12 2023, 10:44 AM

Deniz_WMDE unsubscribed.

Deniz_WMDE subscribed.

new terraform module was merged and tagged, deploy PRs:

staging https://github.com/wmde/wbaas-deploy/pull/1190
production https://github.com/wmde/wbaas-deploy/pull/1191

Tarrow moved this task from In Review to Done on the Wikibase Cloud (Kanban board Q4 2023) board.Oct 20 2023, 8:55 AM

Tarrow updated the task description. (Show Details)

Evelien_WMDE closed this task as Resolved.Oct 23 2023, 9:42 AM

Evelien_WMDE claimed this task.

Deniz_WMDE mentioned this in T349817: False alarms for SQL backups failing.Oct 26 2023, 12:51 PM

	F38209161: image.png
	Oct 11 2023, 6:04 PM

	F38209067: image.png
	Oct 11 2023, 6:04 PM

	F37987583: image.png
	Oct 5 2023, 3:56 PM

Logical backups keep failingClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Logical backups keep failing
Closed, ResolvedPublic
Actions