Page MenuHomePhabricator

Airflow instance 'airflow-test-k8s' looses all XML dumps DAGs intermittently
Closed, ResolvedPublic

Description

While checking progress on the XML enwiki dump run, I got a DAG “mediawiki_dumps_sql_xml_large_a_to_z_full” seems to be missing from DagBag. error from Airflow:

Screenshot 2025-08-12 at 8.50.28 AM.png (728×1 px, 143 KB)

@BTullis mentions that:

We have seen it before and I think that it is related to DAG parsing timeouts. We have even seen it on airflow-main. Should be largely harmless, as tasks a re-adopted, but it would be better if we could stop it from happening.

I was just able to reproduce this as of the creation of this ticket. All XML dump related DAGs were lost for about a minute. All other DAGs seems to be fine.

Slack thread.

Event Timeline

xcollazo triaged this task as Medium priority.Aug 21 2025, 2:33 PM

And again. :(

Bumping priority.

brouberol changed the task status from Open to In Progress.Aug 22 2025, 3:15 PM
brouberol claimed this task.

I've submitted an MR with a large rewrite of the DAG file. Instead of having a single DAG file iterating over partial/full, large/regular, all wiki buckets and each wiki in each bucket. we codegen one dag file for each of the 20 SQL/XML dump DAGs we run.

IMHO this causes an increase of the maintenance burden as the complexity of the overall setup increases, but it dramatically improves how Airflow behaves, alongside several code improvements I found through profiling the generated dag files.

Change #1181519 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] mediawiki-dumps-legacy: adapt RBAC to a recent apache-airflow-providers-cncf-kubernetes upgrade

https://gerrit.wikimedia.org/r/1181519

Change #1181519 merged by Brouberol:

[operations/deployment-charts@master] mediawiki-dumps-legacy: adapt RBAC to a recent apache-airflow-providers-cncf-kubernetes upgrade

https://gerrit.wikimedia.org/r/1181519

Change #1184064 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-test-k8s: increase DAG file parsing interval

https://gerrit.wikimedia.org/r/1184064

Change #1184064 merged by Brouberol:

[operations/deployment-charts@master] airflow-test-k8s: increase DAG file parsing interval

https://gerrit.wikimedia.org/r/1184064

brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1625

test_k8s/dumps/sql_xml: code generate each DAG to free scheduler resources

The change had a positive impact on both scheduler CPU usage as well as DAG parsing time. The DAGs are no longer disappearing from the UI.

image.png (610×2 px, 293 KB)
image.png (546×2 px, 158 KB)

amastilovic closed https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1634

Draft: Some simple performance improvements for mediawiki_sql_xml_dumps.py