Page MenuHomePhabricator

airflow: Restrict the rights for airflow deployers to destroy postgresql clusters
Closed, ResolvedPublic

Description

We recently had an incident whereby an engineer accidentally deleted the postgresql cluster supporting an airflow instance by using helmfile.

For context, each airflow instance comprises two deployments into the same namespace:

  • Airflow, which is stateless
  • A PostgreSQL cluster, which is stateful

Currently, members of the deployers group have the rights to deploy and delete both of these components to the namespace.

However, deleting the PostgreSQL cluster will require either of the following situations in order to bring it back:

  • a freshly created, empty, database
  • the use of the backup and recovery system to restore the last known base backup and WAL replay

With our current setup, this could mean data loss of up to 5 minutes, since that is our current WAL backup schedule.

This ticket is about asking the question about whether or not we have the balance correct between convenience and security.

  • Should we limit the operations on the postgresql clusters to SREs?
  • If so, how? What mechanisms would be at our disposal to achieve a level of database availability with which we are comfortable?

Are there any other recommendations that we can make to improve the disaster recovery preparedness for PostgreSQL clusters?

Event Timeline

Gehel triaged this task as High priority.Apr 8 2025, 2:08 PM
Gehel moved this task from Incoming to Scratch on the Data-Platform-SRE board.

Change #1138748 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] deployment_server: provision separate kubeconfig files for the airflow PG DBs

https://gerrit.wikimedia.org/r/1138748

brouberol changed the task status from Open to In Progress.Apr 24 2025, 1:08 PM

Change #1138748 merged by Brouberol:

[operations/puppet@production] deployment_server: provision separate kubeconfig files for the airflow PG DBs

https://gerrit.wikimedia.org/r/1138748

Change #1138827 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-analytics-test: split the airflow and postgresql deployments

https://gerrit.wikimedia.org/r/1138827

Change #1138827 merged by Brouberol:

[operations/deployment-charts@master] airflow-analytics-test: split the airflow and postgresql deployments

https://gerrit.wikimedia.org/r/1138827

Change #1139657 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: separate postgresql and airflow helmfiles

https://gerrit.wikimedia.org/r/1139657

Change #1139659 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] deployment_server: provision dedicated kubeconfigs for airflow PGs

https://gerrit.wikimedia.org/r/1139659

Change #1139659 merged by Brouberol:

[operations/puppet@production] deployment_server: provision dedicated kubeconfigs for airflow PGs

https://gerrit.wikimedia.org/r/1139659

Change #1139657 merged by Brouberol:

[operations/deployment-charts@master] airflow: separate postgresql and airflow helmfiles

https://gerrit.wikimedia.org/r/1139657

BTullis renamed this task from airflow: Consider restricting the rights for airflow deployers to destroy postgresql clusters to airflow: Restrict the rights for airflow deployers to destroy postgresql clusters.May 1 2025, 3:49 PM