Context:
The main Airflow Postgres database is growing large due to:
- Accumulation of old DAG runs, task instances, logs, XComs, etc.
- All workloads sharing a single Airflow instance
A large, monolithic DB negatively affects performance (larger indexes, slower vacuum/analyze, longer backups) and makes debugging and maintenance harder.
We could improve the situation by:
- Reduce the size of the existing Airflow DB by cleaning up old/non-essential data. It could be done with a dedicated cleanup DAG e.g.https://github.com/teamclairvoyant/airflow-maintenance-dags/blob/master/db-cleanup/airflow-db-cleanup.py With a dag.yaml/config file defining retention policies by DAG/criticality.
- Evaluate and, if beneficial, implement a split of the main Airflow instance into smaller, workload-focused instances.
- we may regroup dumps 1.0 and file exporter into an independent airflow instance
- same with all Cassandra/Druid exporter dags
Risks:
- Over-aggressive cleanup may delete data needed analysis, backfilling, compliance.
- More instances to maintain (upgrades, monitoring, alerting). Some SRE works are needed for automation.
- Possible cross-instance orchestration complexity (dependencies, shared resources).