Page MenuHomePhabricator

Configure scheduled backups and WAL archiving to use our S3 endpoint
Closed, ResolvedPublic

Description

Once we have configured our Ceph cluster to expose buckets implementing the s3 buckets, we could then configure regular backups as well as WAL archiving to using an s3 bucket in Ceph.

We'll need to determine the backup frequency as well, but we could start with daily backups.

Event Timeline

Gehel triaged this task as High priority.Aug 14 2024, 8:32 AM

Given that we have decided to use Ceph/S3 for the scheduled backup and WAL archive, I have added T330152: Deploy ceph radosgw processes to data-engineering cluster and T330153: Configure Anycast load-balancing ceph radosgw services on the data-engineering cluster as parent tickets of this one, then put them into our new milestone.

The original deck says about Postgresql backups...

Frequent backups to external storage. These may be streaming or snapshots, or both.

The only concern that I have about using our Ceph cluster's S3 interface for the database WAL backup and backups, is that it doesn't necessarily count as external storage.

If the ceph cluster is down, then both the CSI interface providing PotgreSQL's block storage and the S3 interface hosting the backups will be unavailable.

On the other hand, using our own Ceph cluster for the WAL streaming and backups also makes a lot of sense, since it is optimised for this combined role.

So my feeling is that we should use Ceph/S3, but that we should have an additional backup mechanism that regularly exports our PostgreSQL database backups from the S3 bucket to local disk, from where they can be backed again to Bacula, offering disaster recovery potential.

Ok, this one took a lot of work to figure out, so I'm going to dump all the details here before I forget.

First off, I encountered a bug in which calling barman-cloud-backup erred with

FileNotFoundError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/']

I checked, and indeed, each of these directories are read-only, due to the controller hardcoding readOnlyFilesystem: true for the postgresql pods.

However, the controller mounts an ephemeral volume to both /run and /controller (the same volume, actually), so we can tell python (barman tools are written in python) to consider /run as a temporary directory when calling tempfile.mkdtemp by setting the TMPDIR value to /run, as see here.

At this point, the backup tools didn't crash, but the postgtresql container did crash after a while when the barman tools were invoked. I figured that it was because the postgresql pods were given so little resource that invoking barman-cloud-backup. caused the healthcheck to fail and the pod to be re-created. I then upped the PG pod resource to

resources:
    limits:
      cpu: 2000m
      memory: 8Gi
    requests:
      cpu: 2000m
      memory: 8Gi

and this time, the pods didn't crash anymore.

The barman commands did hang forever though, which smelled of a NetworkPolicy issue. Relying on the generated Calico NetworkPolicy using the external_services data didn't work here, because the generated selector was app == 'cloudnative-pg-cluster' && release == 'test', which didn't match the PG pod labels.

In the end, I simply copied the network policy template into the cloudnative-pg-cluster chart, with a hardcoded selector:

{{- if $.Values.external_services }}
{{- range $serviceType, $serviceNames := $.Values.external_services }}
{{- if $serviceNames }}
---
apiVersion: crd.projectcalico.org/v1
kind: NetworkPolicy
metadata:
  name: {{ include "cluster.fullname" $ }}-egress-external-services-{{ $serviceType }}
spec:
  selector: "cnpg.io/podRole == 'instance'"
  types:
  - Egress
  egress:
    {{- range $serviceName := $serviceNames }}
    - action: Allow
      destination:
        services:
          name: {{ $serviceType }}-{{ $serviceName }}
          namespace: external-services
    {{- end }}

{{- end }}
{{- end }}
{{- end }}

At that point, the pods could talk to the Rados gateway / s3 endpoint:

brouberol@dse-k8s-worker1008:~$ host rgw.eqiad.dpe.anycast.wmnet
rgw.eqiad.dpe.anycast.wmnet has address 10.3.0.8
rgw.eqiad.dpe.anycast.wmnet has IPv6 address 2a02:ec80:ff00:101::8
brouberol@dse-k8s-worker1008:~$ nsenter-container postgresql-test-3 postgres telnet 10.3.0.8 443
Trying 10.3.0.8...
Connected to 10.3.0.8.
Escape character is '^]'.
^]

telnet> quit
Connection closed.
brouberol@dse-k8s-worker1008:~$ nsenter-container postgresql-test-3 postgres telnet 2a02:ec80:ff00:101::8 443
Trying 2a02:ec80:ff00:101::8...
Connected to 2a02:ec80:ff00:101::8.
Escape character is '^]'.
^]

telnet> quit
Connection closed.

At that point, the barman tools managed to reach out to the S3 endpoint, but failed with an SSL certificate validation error. I found a way to render the certificates we deploy on all hosts to /etc/ssl/certs/wmf-ca-certificates.crt into a Secret, that is realized on disk at /controller/certificates/backup-barman-ca.crt. When a backup is performed, the controller injects the AWS_CA_BUNDLE=/controller/certificates/backup-barman-ca.crt environment variable, allowing the boto client to validate the s3 endpoint certificate.

At this point, the barman tools could access to the s3://postgresql bucket, but any PUT request failed with

ERROR: Barman cloud WAL archiver exception: An error occurred (InvalidRequest) when calling the PutObject operation: None"

Example:

postgres@postgresql-test-2:/srv/app$ export AWS_ACCESS_KEY_ID={REDACTED]
postgres@postgresql-test-2:/srv/app$ export AWS_SECRET_ACCESS_KEY=[REDACTED]
postgres@postgresql-test-2:/srv/app$ export AWS_CA_BUNDLE=/controller/certificates/backup-barman-ca.crt # this is the wmf-ca-certificate.crt file
postgres@postgresql-test-2:/srv/app$ barman-cloud-backup --user postgres --name backup-20240911142505 --gzip --encryption AES256 --jobs 2 --endpoint-url https://rgw.eqiad.dpe.anycast.wmnet --cloud-provider aws-s3  --verbose s3://postgresql/postgresql-test/ postgresql-test
2024-09-12 07:04:01,382 [560] INFO: Found credentials in environment variables.
2024-09-12 07:04:01,461 [560] INFO: Starting backup '20240912T070401'
2024-09-12 07:04:05,938 [560] INFO: Uploading 'pgdata' directory '/var/lib/postgresql/data/pgdata' as 'data.tar.gz'
2024-09-12 07:04:09,613 [560] INFO: Uploading 'pg_control' file from '/var/lib/postgresql/data/pgdata/global/pg_control' to 'data.tar.gz' with path 'global/pg_control'
2024-09-12 07:04:09,614 [560] INFO: Stopping backup '20240912T070401'
2024-09-12 07:04:09,616 [560] INFO: Uploading 'backup_label' file to 'data.tar.gz' with path 'backup_label'
2024-09-12 07:04:09,616 [560] INFO: Marking all the uploaded archives as 'completed'
2024-09-12 07:04:09,624 [560] ERROR: Backup failed uploading data (An error occurred (InvalidRequest) when calling the CreateMultipartUpload operation: None)
2024-09-12 07:04:09,624 [560] INFO: Uploading 'postgresql-test/postgresql-test/base/20240912T070401/backup.info'
2024-09-12 07:04:09,672 [560] ERROR: Backup failed uploading backup.info file (An error occurred (InvalidRequest) when calling the PutObject operation: None)

I tried to assess whether the credentials were wrong or not:

Python 3.11.2 (main, May  2 2024, 11:59:08) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import boto3
>>> client = boto3.client(service_name="s3", endpoint_url="https://rgw.eqiad.dpe.anycast.wmnet")
>>> paginator = client.get_paginator("list_objects_v2")
>>> list(paginator.paginate(Bucket="postgresql"))
[{'ResponseMetadata': {'RequestId': 'tx00000c3b3ea2022d3667a-0066e297ec-9ab5f8-eqiad', 'HostId': '', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-request-id': 'tx00000c3b3ea2022d3667a-0066e297ec-9ab5f8-eqiad', 'content-type': 'application/xml', 'date': 'Thu, 12 Sep 2024 07:27:40 GMT', 'x-envoy-upstream-service-time': '4', 'server': 'envoy', 'transfer-encoding': 'chunked'}, 'RetryAttempts': 0}, 'IsTruncated': False, 'Name': 'postgresql', 'Prefix': '', 'MaxKeys': 1000, 'EncodingType': 'url', 'KeyCount': 0}]
>>> client.put_object(Bucket='postgresql', Key='/tmpfile.txt', Body=open('/run/tmpfile.txt', 'rb'))
{'ResponseMetadata': {'RequestId': 'tx0000039eceb65faf1fc32-0066e2992b-9ab5f8-eqiad', 'HostId': '', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-length': '0', 'etag': '"86fb269d190d2c85f6e0468ceca42a20"', 'accept-ranges': 'bytes', 'x-amz-request-id': 'tx0000039eceb65faf1fc32-0066e2992b-9ab5f8-eqiad', 'date': 'Thu, 12 Sep 2024 07:32:59 GMT', 'x-envoy-upstream-service-time': '11', 'server': 'envoy'}, 'RetryAttempts': 0}, 'ETag': '"86fb269d190d2c85f6e0468ceca42a20"'}
>>>

So, clearly, the credentials worked, but somehow uploading the backups failed. It turns out that the issue was having to do with us requesting that the WALs and backups be encrypted, when the S3 bucket was not configured to do so.
The encryption does not happen at the client level, but at the S3 bucket level, so when barman asked for the data to be encrypted, the S3 server responded "403 no way jose".

We. could reproduce the issue with the CLI:

# with encryption: boo
postgres@postgresql-test-1:/var/lib/postgresql/data/pgdata$ barman-cloud-wal-archive --gzip -e AES256 --endpoint-url https://rgw.eqiad.dpe.anycast.wmnet --verbose --cloud-provider aws-s3 s3://postgresql/postgresql-test/ postgresql-test pg_wal/000000010000000000000001
2024-09-12 08:12:30,927 [610] INFO: Found credentials in environment variables.
2024-09-12 08:12:32,862 [610] ERROR: Barman cloud WAL archiver exception: An error occurred (InvalidRequest) when calling the PutObject operation: None

# without encryption: woohoo
postgres@postgresql-test-1:/var/lib/postgresql/data/pgdata$ barman-cloud-wal-archive --gzip  --endpoint-url https://rgw.eqiad.dpe.anycast.wmnet --verbose --cloud-provider aws-s3 s3://postgresql/postgresql-test/ postgresql-test pg_wal/000000010000000000000001
2024-09-12 08:14:46,798 [864] INFO: Found credentials in environment variables.
postgres@postgresql-test-1:/var/lib/postgresql/data/pgdata$

We disabled WAL and backup encryption for now, and finally, we started seeing data flowing into S3!

brouberol@stat1008:~$ s3cmd ls s3://postgresql/postgresql-test/base/20240912T085149/
2024-09-12 08:51         1308  s3://postgresql/postgresql-test/base/20240912T085149/backup.info
2024-09-12 08:51      4112514  s3://postgresql/postgresql-test/base/20240912T085149/data.tar.gz
brouberol@stat1008:~$ s3cmd ls s3://postgresql/postgresql-test/wals/
                          DIR  s3://postgresql/postgresql-test/wals/0000000100000000/
brouberol@stat1008:~$ s3cmd ls s3://postgresql/postgresql-test/wals/0000000100000000/
2024-09-12 08:51      2366741  s3://postgresql/postgresql-test/wals/0000000100000000/000000010000000000000001.gz
2024-09-12 08:51        29422  s3://postgresql/postgresql-test/wals/0000000100000000/000000010000000000000002.gz
2024-09-12 08:51          215  s3://postgresql/postgresql-test/wals/0000000100000000/000000010000000000000003.00000028.backup.gz
2024-09-12 08:51        16421  s3://postgresql/postgresql-test/wals/0000000100000000/000000010000000000000003.gz
2024-09-12 08:51          200  s3://postgresql/postgresql-test/wals/0000000100000000/000000010000000000000004.000000D8.backup.gz
2024-09-12 08:51        16521  s3://postgresql/postgresql-test/wals/0000000100000000/000000010000000000000004.gz
2024-09-12 08:51        16413  s3://postgresql/postgresql-test/wals/0000000100000000/000000010000000000000005.gz
2024-09-12 08:52        95748  s3://postgresql/postgresql-test/wals/0000000100000000/000000010000000000000006.gz
2024-09-12 08:52          199  s3://postgresql/postgresql-test/wals/0000000100000000/000000010000000000000007.00000028.backup.gz
2024-09-12 08:52        16489  s3://postgresql/postgresql-test/wals/0000000100000000/000000010000000000000007.gz
2024-09-12 08:57        65518  s3://postgresql/postgresql-test/wals/0000000100000000/000000010000000000000008.gz

Which we could see on our dashboard:
{F57502368}

I still don't understand how anyone does not bump their head against the TMPDIR env var issue, especially because the upstream chart does not offer an easy way to inject custom env var into the PG pods. It is possible, mind you, just not exposed by the chart by default.

Change #1072504 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] cloudnative-pg-cluster: enable wal upload / backups to s3 by default

https://gerrit.wikimedia.org/r/1072504

Change #1072546 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] cloudnative-pg-cluster: setup good defaults allowing a cluster to be restored

https://gerrit.wikimedia.org/r/1072546

Change #1072504 merged by jenkins-bot:

[operations/deployment-charts@master] cloudnative-pg-cluster: enable wal upload / backups to s3 by default

https://gerrit.wikimedia.org/r/1072504

Change #1072546 merged by jenkins-bot:

[operations/deployment-charts@master] cloudnative-pg-cluster: setup good defaults allowing a cluster to be restored

https://gerrit.wikimedia.org/r/1072546

Change #1073445 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: ensure each airflow release store logs to a unique s3 bucket

https://gerrit.wikimedia.org/r/1073445

Change #1073445 merged by Brouberol:

[operations/deployment-charts@master] airflow: ensure each airflow release store logs to a unique s3 bucket

https://gerrit.wikimedia.org/r/1073445