Page MenuHomePhabricator

Reduce main Airflow DB size and consider splitting heavy workloads into separate instances
Open, Needs TriagePublic

Description

Context:
The main Airflow Postgres database is growing large due to:

  • Accumulation of old DAG runs, task instances, logs, XComs, etc.
  • All workloads sharing a single Airflow instance

A large, monolithic DB negatively affects performance (larger indexes, slower vacuum/analyze, longer backups) and makes debugging and maintenance harder.

We could improve the situation by:

  • Reduce the size of the existing Airflow DB by cleaning up old/non-essential data. It could be done with a dedicated cleanup DAG e.g.https://github.com/teamclairvoyant/airflow-maintenance-dags/blob/master/db-cleanup/airflow-db-cleanup.py With a dag.yaml/config file defining retention policies by DAG/criticality.
  • Evaluate and, if beneficial, implement a split of the main Airflow instance into smaller, workload-focused instances.
    • we may regroup dumps 1.0 and file exporter into an independent airflow instance
    • same with all Cassandra/Druid exporter dags

Risks:

  • Over-aggressive cleanup may delete data needed analysis, backfilling, compliance.
  • More instances to maintain (upgrades, monitoring, alerting). Some SRE works are needed for automation.
  • Possible cross-instance orchestration complexity (dependencies, shared resources).

Event Timeline

We're also investigating the possibility of the Airflow metadata DB being implicated in: T412003: Airflow-main scheduler loop sometimes slows down markedly

For reference, we also have an Airflow DB maintenance DAG already tested and ready.
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/test_k8s/dags/cleanup_airflow_db_dag.py

So far, however, it has only been deployed to the test_k8s airflow instance.

Antoine_Quhen added a subscriber: brouberol.

Deployment of db_cleaner dag on Airflow instances went mostly well.

The exception was search instance. The DB was too tight on resources for the airflow db clean script. The later is moving the deleted rows into a temporary table, later removed. When cleaning table logs, this process generated a 11M rows temporary table, crashing the DB.

Thanks to @brouberol the cluster was put back up and the tmp tables manually deleted.

Change #1228445 had a related patch set uploaded (by Joal; author: Joal):

[operations/deployment-charts@master] Grow walStorage on dse-k8s pg_airflow_search

https://gerrit.wikimedia.org/r/1228445

Change #1228445 merged by Brouberol:

[operations/deployment-charts@master] Grow storage on dse-k8s pg_airflow_search

https://gerrit.wikimedia.org/r/1228445

Closing. Next optimization could be splitting from main instance file_exporters job. Should be done in another ticket.

search and analytics_test are running on small DB instances with no free headroom.
During deletes on large tables, Airflow’s current strategy stages deleted rows into a temporary table before removing them. That briefly requires ~2x table storage, which causes the database to run out of space and crash.

This failed again in the 2026-02-03 run. Since the same script already ran in January, this looks like a recurring capacity/strategy issue rather than a one-time first-run effect.

I’ll monitor the 2026-03-03 run with SRE.

In parallel, I prepared a fix to switch to a direct delete strategy (no temp-table copy). The tradeoff is custom logic that is coupled to the current Airflow version.