Right now, we're batching dump jobs by 5 (which has led to some substantial parsing, scheduling and execution time improvements), but the batches themselves don't make a whole lot of sense. We need to now make sure to batch these jobs in a way that reflect a business logic.
Description
Description
Details
Details
Related Changes in GitLab:
| Title | Reference | Author | Source Branch | Dest Branch | |
|---|---|---|---|---|---|
| Dumps_v1: Run dumps in stages instead of batches | repos/data-engineering/airflow-dags!1372 | btullis | batch_dumps | main | |
| test_k8s/dumpsv1: execute the next dump batch while syncing the current batch files | repos/data-engineering/airflow-dags!1298 | brouberol | T392915 | main |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T88728 Improve Wikimedia dumping infrastructure | |||
| Resolved | BTullis | T352650 WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes | |||
| Resolved | brouberol | T388378 Orchestrate dumps v1 from an airflow instance | |||
| Resolved | BTullis | T392915 Batch dump jobs in a way that makes business sense |
Event Timeline
Comment Actions
brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1298
test_k8s/dumpsv1: execute the next dump batch while syncing the current batch files
Comment Actions
btullis merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1372
Dumps_v1: Run dumps in stages instead of batches
Comment Actions
For a regular sized wiki, the task graph looks like this:
For a large or huge wiki, the task graph looks like this:
This is now working.
For a regular sized wiki, the task graph looks like this:
For a large or huge wiki, the task graph looks like this:


