Page MenuHomePhabricator

Experiment with running multiple concurrent dump DAGs to keep each graph relatively low in size
Closed, ResolvedPublic

Description

We've observed a pretty linear increase between the number of wikis handled by the dumps v1 DAG and the scheduler loop duration (mainly spent checking task dependencies).

We could experiment with having a DAG per wiki first letter, or simply a maximum of X wikis per. DAG. This way, each DAG would have about 1000 tasks, instead of a single DAG with ~30000 tasks. As we're limited by the number of pool slots and overall parallelism, it might actually be lest costly to shard tasks to more DAGs?

Screenshot 2025-04-11 at 15.27.28.png (548×996 px, 71 KB)

A dumps DAG run with 16 wikis (left) and 32 wikis (right)

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
test_k8s/dumpsv1: generate multiple DAGs for regular wiki dumps to keep UI loading time lowrepos/data-engineering/airflow-dags!1276brouberolT391684main
Customize query in GitLab

Event Timeline

brouberol triaged this task as Medium priority.
brouberol updated the task description. (Show Details)

brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1276

test_k8s/dumpsv1: generate multiple DAGs for regular wiki dumps to keep UI loading time low