We recently had an incident whereby an engineer accidentally deleted the postgresql cluster supporting an airflow instance by using helmfile.
For context, each airflow instance comprises two deployments into the same namespace:
- Airflow, which is stateless
- A PostgreSQL cluster, which is stateful
Currently, members of the deployers group have the rights to deploy and delete both of these components to the namespace.
However, deleting the PostgreSQL cluster will require either of the following situations in order to bring it back:
- a freshly created, empty, database
- the use of the backup and recovery system to restore the last known base backup and WAL replay
With our current setup, this could mean data loss of up to 5 minutes, since that is our current WAL backup schedule.
This ticket is about asking the question about whether or not we have the balance correct between convenience and security.
- Should we limit the operations on the postgresql clusters to SREs?
- If so, how? What mechanisms would be at our disposal to achieve a level of database availability with which we are comfortable?
Are there any other recommendations that we can make to improve the disaster recovery preparedness for PostgreSQL clusters?