Page MenuHomePhabricator

Airflow-main scheduler loop sometimes slows down markedly
Open, HighPublicBUG REPORT

Description

We have observed sustained periods where the scheduler loop duration on the airflow-main instance dramatically increases.
https://grafana.wikimedia.org/goto/0LJ41kGvR?orgId=1

image.png (1×2 px, 721 KB)

The reason for this is currently unknown, but our investigation so far has not pinpointed a single DAG being responsible.

On the previous occasion (2025-12-01), the scheduler performance was restored by deleting the existing scheduler pod.
The behaviour has now recurred but we need to investigate the cause.

Event Timeline

BTullis triaged this task as High priority.Dec 8 2025, 10:49 AM
BTullis changed the subtype of this task from "Task" to "Bug Report".

Our working theory for this incident previously, was that it was somehow related to this: T409924: High pod latency affecting several dse-k8s-worker nodes in eqiad C/D rows

Certain dse-k8s-workers were exhibiting high pod latency, which may have been related to the migration to new network switches in the eqiad C&D racks.

However, the eveidence for this is no longer very strong. We have been slowly working through a rolling reboot of the dse-k8s-workers.
The current airflow-main scheduler pod is running on dse-k8s-worker1005, which had been restarted after the incident and is not showing any pod latency.

btullis@deploy2002:/srv/deployment-charts/helmfile.d/dse-k8s-services/airflow-main$ kubectl get pods airflow-scheduler-6df6858576-nbnv9 -owide
NAME                                 READY   STATUS    RESTARTS   AGE     IP            NODE                             NOMINATED NODE   READINESS GATES
airflow-scheduler-6df6858576-nbnv9   1/1     Running   0          2d11h   10.67.28.52   dse-k8s-worker1005.eqiad.wmnet   <none>           <none>

We can see a reduced PostgreSQL throughput from airflow-main:

image.png (996×2 px, 724 KB)

...but this doesn't seem to be related to pressure on the database itself.

It seems more likely that the scheduler is stuck in a loop, doing something else.

I have killed the current scheduler pod at 10:44 UTC.

btullis@deploy2002:/srv/deployment-charts/helmfile.d/dse-k8s-services/airflow-main$ kubectl delete pod airflow-scheduler-6df6858576-nbnv9
pod "airflow-scheduler-6df6858576-nbnv9" deleted

We can see if this improves the schduler loop duration, or if the issue remains.

The restarted scheduler pod appears not to have reduced the scheduler loop duration.

image.png (1×2 px, 298 KB)

The next thing that we would like to test is to clean old database records from the Airflow-main database.

We noticed that we don't have any automatic database maintenance DAG, such as cleanup_airflow_db_dag running on this instance.

I did a dry run of the airflow db clean command, to see how many records would be removed from the database if we were to retain six months' worth of all tables.

runuser@airflow-webserver-5fb8fbf959-2jxzn:/opt/airflow$ airflow db clean --dry-run --clean-before-timestamp '2025-06-08 00:00:00' --skip-archive
/home/app/.local/lib/python3.11/site-packages/airflow/metrics/statsd_logger.py:184 RemovedInAirflow3Warning: The basic metric validator will be deprecated in the future in favor of pattern-matching.  You can try this now by setting config option metrics_use_pattern_match to True.
/opt/airflow/airflow_local_settings.py:4 DeprecationWarning: The `airflow.kubernetes.pod_generator.PodDefaults` class is deprecated. Please use `'airflow.kubernetes.pre_7_4_0_compatibility.pod_generator_deprecated.PodDefaults'`. The `cncf.kubernetes` provider must be >= 7.4.0 for that..
Performing dry run for db cleanup.
Data prior to 2025-06-08 00:00:00+00:00 would be purged from tables {'session', 'import_error', 'xcom', 'trigger', 'dataset_event', 'task_instance', 'celery_taskmeta', 'job', 'log', 'callback_request', 'dag_run', 'task_fail', 'dag', 'sla_miss', 'task_reschedule', 'celery_tasksetmeta', 'task_instance_history'} with the following config:

table                 | recency_column                   | keep_last | keep_last_filters            | keep_last_group_by
======================+==================================+===========+==============================+===================
callback_request      | callback_request.created_at      | False     | None                         | None              
celery_taskmeta       | celery_taskmeta.date_done        | False     | None                         | None              
celery_tasksetmeta    | celery_tasksetmeta.date_done     | False     | None                         | None              
dag                   | dag.last_parsed_time             | False     | None                         | None              
dag_run               | dag_run.start_date               | True      | ['external_trigger = false'] | ['dag_id']        
dataset_event         | dataset_event.timestamp          | False     | None                         | None              
import_error          | import_error.timestamp           | False     | None                         | None              
job                   | job.latest_heartbeat             | False     | None                         | None              
log                   | log.dttm                         | False     | None                         | None              
session               | session.expiry                   | False     | None                         | None              
sla_miss              | sla_miss.timestamp               | False     | None                         | None              
task_fail             | task_fail.start_date             | False     | None                         | None              
task_instance         | task_instance.start_date         | False     | None                         | None              
task_instance_history | task_instance_history.start_date | False     | None                         | None              
task_reschedule       | task_reschedule.start_date       | False     | None                         | None              
trigger               | trigger.created_date             | False     | None                         | None              
xcom                  | xcom.timestamp                   | False     | None                         | None              
                                                                                                                        
/home/app/.local/lib/python3.11/site-packages/airflow/utils/db.py:1090 SAWarning: Skipped unsupported reflection of expression-based index idx_ab_user_username
/home/app/.local/lib/python3.11/site-packages/airflow/utils/db.py:1090 SAWarning: Skipped unsupported reflection of expression-based index idx_ab_register_user_username

Performing dry run for table callback_request
Checking table callback_request
Found 0 rows meeting deletion criteria.
[2025-12-08T12:58:24.367+0000] {db_cleanup.py:448} WARNING - Table celery_taskmeta not found.  Skipping.
[2025-12-08T12:58:24.368+0000] {db_cleanup.py:448} WARNING - Table celery_tasksetmeta not found.  Skipping.

Performing dry run for table dag
Checking table dag
Found 0 rows meeting deletion criteria.

Performing dry run for table dag_run
Checking table dag_run
Found 44136 rows meeting deletion criteria.

Performing dry run for table dataset_event
Checking table dataset_event
Found 0 rows meeting deletion criteria.

Performing dry run for table import_error
Checking table import_error
Found 0 rows meeting deletion criteria.

Performing dry run for table job
Checking table job
Found 1781691 rows meeting deletion criteria.

Performing dry run for table log
Checking table log
Found 7714800 rows meeting deletion criteria.

Performing dry run for table session
Checking table session
Found 212372 rows meeting deletion criteria.

Performing dry run for table sla_miss
Checking table sla_miss
Found 7690 rows meeting deletion criteria.

Performing dry run for table task_fail
Checking table task_fail
Found 6951 rows meeting deletion criteria.

Performing dry run for table task_instance
Checking table task_instance
Found 1118088 rows meeting deletion criteria.

Performing dry run for table task_instance_history
Checking table task_instance_history
Found 78973 rows meeting deletion criteria.

Performing dry run for table task_reschedule
Checking table task_reschedule
Found 486739 rows meeting deletion criteria.

Performing dry run for table trigger
Checking table trigger
Found 0 rows meeting deletion criteria.

Performing dry run for table xcom
Checking table xcom
Found 16423 rows meeting deletion criteria.
runuser@airflow-webserver-5fb8fbf959-2jxzn:/opt/airflow$

The highlights are:

  • 1.8 million job records
  • 7.8 million log records
  • 1.1 million task instances
  • 500k task retry instances
  • 200k session records

I will do a one-off purge like this, then check the scheduler loop duration to see if it makes any difference.

Either way, we can then come back to setting up a proper database maintenance schedule.

Mentioned in SAL (#wikimedia-analytics) [2025-12-08T13:35:44Z] <btullis> purged airflow-main database records older than 6 months for T412003

Unexpected errors when purging the old database logs.

runuser@airflow-webserver-5fb8fbf959-2jxzn:/opt/airflow$ airflow db clean --clean-before-timestamp '2025-06-08 00:00:00' --skip-archive
/home/app/.local/lib/python3.11/site-packages/airflow/metrics/statsd_logger.py:184 RemovedInAirflow3Warning: The basic metric validator will be deprecated in the future in favor of pattern-matching.  You can try this now by setting config option metrics_use_pattern_match to True.
/opt/airflow/airflow_local_settings.py:4 DeprecationWarning: The `airflow.kubernetes.pod_generator.PodDefaults` class is deprecated. Please use `'airflow.kubernetes.pre_7_4_0_compatibility.pod_generator_deprecated.PodDefaults'`. The `cncf.kubernetes` provider must be >= 7.4.0 for that..
You have requested that we purge all data prior to 2025-06-08 00:00:00+00:00 for tables ['callback_request', 'celery_taskmeta', 'celery_tasksetmeta', 'dag', 'dag_run', 'dataset_event', 'import_error', 'job', 'log', 'session', 'sla_miss', 'task_fail', 'task_instance', 'task_instance_history', 'task_reschedule', 'trigger', 'xcom'].
This is irreversible.  Consider backing up the tables first and / or doing a dry run with option --dry-run.
Enter 'delete rows' (without quotes) to proceed.
delete rows
/home/app/.local/lib/python3.11/site-packages/airflow/utils/db.py:1090 SAWarning: Skipped unsupported reflection of expression-based index idx_ab_user_username
/home/app/.local/lib/python3.11/site-packages/airflow/utils/db.py:1090 SAWarning: Skipped unsupported reflection of expression-based index idx_ab_register_user_username

Checking table callback_request
Found 0 rows meeting deletion criteria.
[2025-12-08T13:34:12.910+0000] {db_cleanup.py:448} WARNING - Table celery_taskmeta not found.  Skipping.
[2025-12-08T13:34:12.911+0000] {db_cleanup.py:448} WARNING - Table celery_tasksetmeta not found.  Skipping.

Checking table dag
Found 0 rows meeting deletion criteria.

Checking table dag_run
Found 44136 rows meeting deletion criteria.
Performing Delete...
Moving data to table _airflow_deleted__dag_run__20251208133413
Finished Performing Delete

Checking table dataset_event
Found 0 rows meeting deletion criteria.

Checking table import_error
Found 0 rows meeting deletion criteria.

Checking table job
Found 1781691 rows meeting deletion criteria.
Performing Delete...
Moving data to table _airflow_deleted__job__20251208133536
Finished Performing Delete

Checking table log
Found 7714800 rows meeting deletion criteria.
Performing Delete...
Moving data to table _airflow_deleted__log__20251208133552
[2025-12-08T13:41:29.470+0000] {db_cleanup.py:359} WARNING - Encountered error when attempting to clean table 'log'. 

Checking table session
Found 212372 rows meeting deletion criteria.
Performing Delete...
Moving data to table _airflow_deleted__session__20251208134144
[2025-12-08T13:41:45.313+0000] {db_cleanup.py:359} WARNING - Encountered error when attempting to clean table 'session'. 

Checking table sla_miss
Found 7690 rows meeting deletion criteria.
Performing Delete...
Moving data to table _airflow_deleted__sla_miss__20251208134145
Finished Performing Delete

Checking table task_fail
Found 915 rows meeting deletion criteria.
Performing Delete...
Moving data to table _airflow_deleted__task_fail__20251208134145
[2025-12-08T13:41:45.824+0000] {db_cleanup.py:359} WARNING - Encountered error when attempting to clean table 'task_fail'. 

Checking table task_instance
Found 37976 rows meeting deletion criteria.
Performing Delete...
Moving data to table _airflow_deleted__task_instance__20251208134151
[2025-12-08T13:41:53.431+0000] {db_cleanup.py:359} WARNING - Encountered error when attempting to clean table 'task_instance'. 

Checking table task_instance_history
[2025-12-08T13:41:53.447+0000] {db_cleanup.py:359} WARNING - Encountered error when attempting to clean table 'task_instance_history'. 

Checking table task_reschedule
[2025-12-08T13:41:53.461+0000] {db_cleanup.py:359} WARNING - Encountered error when attempting to clean table 'task_reschedule'. 

Checking table trigger
Found 0 rows meeting deletion criteria.

Checking table xcom
[2025-12-08T13:41:53.491+0000] {db_cleanup.py:359} WARNING - Encountered error when attempting to clean table 'xcom'.

image.png (423×604 px, 84 KB)

The WAL volume filled up to 100% of its capacity:

image.png (426×2 px, 189 KB)

Change #1216589 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Bump the size of the PostgreSQL WAL for airflow-main to 30GB

https://gerrit.wikimedia.org/r/1216589

Change #1216589 merged by jenkins-bot:

[operations/deployment-charts@master] Bump the size of the PostgreSQL WAL for airflow-main to 30GB

https://gerrit.wikimedia.org/r/1216589

We successfully deleted all records older than 6 months' from the database, but it didn't make a huge difference to the behaviour of the scheduler loop time.

For our next set of troubleshooting, we would like to look into enabling the pg_stat_statements extension to PostgreSQL, which should give us the ability to find the slowest and the most frequent queries.

Change #1217149 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Correct the values for postgresql parameters on analytics-test

https://gerrit.wikimedia.org/r/1217149

Change #1217149 merged by jenkins-bot:

[operations/deployment-charts@master] Correct the values for postgresql parameters on analytics-test

https://gerrit.wikimedia.org/r/1217149

Change #1218709 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/deployment-charts@master] postgresql-airflow-main: Increase pgbouncer pool size

https://gerrit.wikimedia.org/r/1218709

Change #1218709 merged by jenkins-bot:

[operations/deployment-charts@master] postgresql-airflow-main: Increase pgbouncer pool size

https://gerrit.wikimedia.org/r/1218709