Page MenuHomePhabricator

Split Cassandra Airflow dags by dataset
Closed, ResolvedPublic3 Estimated Story Points


We currently have 1 dag for each import schedule to Cassandra: hourly, daily, and monthly. So we have 3 dags.

But within each (except hourly which has only 1 dataset), we have a long list of datasets imported to Cassandra in parallel.

This is not convenient when backfilling 1 dataset.

So we may generate as many dags as the combination of dataset+schedule.


ReferenceSource BranchDest BranchAuthorTitle
repos/data-engineering/airflow-dags!455update_analytics_cassandra_loadingmainjoalSplit cassandra loading jobs by datasets
Customize query in GitLab

Event Timeline

So if I'm understanding correctly, you want to refactor the code so that instead of generating one DAG with many tasks, we generate many DAGs with one task. Yes?

JArguello-WMF set the point value for this task to 3.Jun 5 2023, 4:32 PM