Page MenuHomePhabricator

Logical backups keep failing
Closed, ResolvedPublic

Assigned To
Authored By
Deniz_WMDE
Oct 4 2023, 2:21 PM
Referenced Files
F38209161: image.png
Oct 11 2023, 6:04 PM
F38209067: image.png
Oct 11 2023, 6:04 PM
F37987583: image.png
Oct 5 2023, 3:56 PM

Description

Today I noticed that the logical backups keep failing again on production, backup buckets are empty because of the 7 day retain policy.

  • The last sql-logic-backup pods that are completed and are still around on the prod cluster are from Sep 19th/20
  • The oldest failing one that is still around is from Oct 3rd, the youngest one from today (Oct 4th)

I suspect that the currently available ephemeral storage is yet again to little, since the backup keeps growing in size (currently allocatable are 26.78 GB out of 63.16 GB).

ACs:

  • Logical SQL backups work again
    • E-Mail alerting for failed logical backups is in place
    • E-Mail alerting for logical backup disk space below threshold

PRs:

Event Timeline

Logs of the failing pods look like this:

kubectl logs sql-logic-backup-28271520-sk5df
/backups/tmp/backup-2023-10-03_000018/metadata exists!
/backups/tmp/backup-2023-10-03_000018/apidb.users.sql exists!
/backups/tmp/backup-2023-10-03_000018/apidb.users-schema.sql exists!
/backups/tmp/backup-2023-10-03_000018/apidb.wikis.sql exists!
/backups/tmp/backup-2023-10-03_000018/apidb.wikis-schema.sql exists!
/backups/tmp/backup-2023-10-03_000018/mysql.db.sql exists!
/backups/tmp/backup-2023-10-03_000018/mysql.db-schema.sql exists!
error writing output file
tar: -: Cannot write: Broken pipe
tar: Child returned status 141
tar: Error is not recoverable: exiting now

Where the third to last line differs sometimes to a message like tar: -: Wrote only 4096 of 10240 bytes (sometimes 2048 or 8192 instead of 4096).

I wrote an ADR to address the situation we found ourselves in again, where we don't have any logical sql backups because each file gets deleted after 7 days, no matter what. We could easily get around that with some minor changes

Scratched this, since it allows for corrupt backups to pile up the versioning. We could think about increasing the backup retention period though.

Increasing the scratch disk size worked out: https://github.com/wmde/wbaas-deploy/pull/1181

I looked at the current disk space that is used: 41GB SQL files get compressed to an 25GB encrypted archive, this requires a scratch disk of at least 66GB, which explains why 64GB was not enough. (compression rate of approximately 60%)

For some reason no incidents and thus no alerts were triggered, while we can see here that all conditions were met. (link)

image.png (1×1 px, 93 KB)

I tested the current alert configuration by briefly setting the duration of the condition_absent in terraform to 3600s (1 hour) - this immediately triggered several incidents and alert mails. This leaves me completely puzzled as to why this did not work for production in the last few weeks.

This PR introduces a new alert policy for giving us a heads up in case the backup scratch volume runs low on disk space. https://github.com/wmde/wbaas-deploy/pull/1185

As of right now it can't be applied via terraform, because of one of the filter rule used: resource.label."pod_name" = starts_with("sql-logic-backup-"). In the GCP Metrics Explorer it looks like it's looking for a pod with the literal name starts_with("sql-logic-backup-"), which is of course not the intention. (note that this was taken from the docs).
Clicking the filter together manually in the Metrics Explorer via an requals regex comparison (=~) and switching over to the MQL view, we see it reads resource.pod_name =~ 'sql-logic-backup-.*'. But trying to apply this via terraform throws an error in the GCP API, it's not happy with the =~ operator:

│ Error: Error creating AlertPolicy: googleapi: Error 400: Field alert_policy.conditions[0].condition_threshold.filter had an invalid value of "metric.type="kubernetes.io/pod/volume/utilization"
│ resource.type="k8s_pod"
│ resource.label."cluster_name"="wbaas-2"
│ resource.pod_name=~'sql-logic-backup-.*'
│ ": Could not parse filter "metric.type=\"kubernetes.io/pod/volume/utilization\"\nresource.type=\"k8s_pod\"\nresource.label.\"cluster_name\"=\"wbaas-2\"\nresource.pod_name=~\'sql-logic-backup-.*\'\n"; syntax error at line 4, column 18, token '=~'

Today I managed to get the new alert working. Part of my struggles can only be explained by oversights and the complexities of testing these policies. In other words, even this commit from two days ago seems like it worked.


About the missing incidents for the "failing backups" alerting policy, I think I could reproduce the issue by suspending the cronjob yesterday (I resumed it today & manually triggered a backup), as can be seen here:

image.png (1×2 px, 104 KB)

Notice the last two data points: a scheduled backup that finished yesterday morning 3 AM vs the one that I manually triggered and finished 12PM). The time difference between those two exceeds 24h 30m (or 88200s), but no incident spawned.

When looking at this policy in the GCP UI, we can see that the trigger absence time is set to 5 minutes. By editing this field and trying to input 88200 seconds, we can see a warning that tells us that the maximum value can only be 1 day:

image.png (693×826 px, 80 KB)

This is still confusing because this should trigger the alerting more than ever, but it didn't. I'll set this to 1 day and repeat the test described above.

Deniz_WMDE updated the task description. (Show Details)

Setting the duration to 86400s (or 24h) worked, it triggered an alert and incident in Google Cloud Monitoring.

https://github.com/wmde/wbaas-deploy/pull/1185

Deniz_WMDE unsubscribed.
Deniz_WMDE subscribed.
Evelien_WMDE claimed this task.