Page MenuHomePhabricator

Batch dump jobs in a way that makes business sense
Closed, ResolvedPublic

Assigned To
Authored By
brouberol
Apr 29 2025, 3:54 PM
Referenced Files
F62292986: image.png
Jun 11 2025, 1:54 PM
F62292971: image.png
Jun 11 2025, 1:54 PM
F62292959: image.png
Jun 11 2025, 1:54 PM

Description

Right now, we're batching dump jobs by 5 (which has led to some substantial parsing, scheduling and execution time improvements), but the batches themselves don't make a whole lot of sense. We need to now make sure to batch these jobs in a way that reflect a business logic.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Dumps_v1: Run dumps in stages instead of batchesrepos/data-engineering/airflow-dags!1372btullisbatch_dumpsmain
test_k8s/dumpsv1: execute the next dump batch while syncing the current batch filesrepos/data-engineering/airflow-dags!1298brouberolT392915main
Customize query in GitLab

Event Timeline

brouberol triaged this task as Medium priority.
brouberol changed the task status from Open to In Progress.Apr 29 2025, 3:55 PM
brouberol claimed this task.

brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1298

test_k8s/dumpsv1: execute the next dump batch while syncing the current batch files

brouberol reopened this task as In Progress.

This is now working.

image.png (896×1 px, 167 KB)

For a regular sized wiki, the task graph looks like this:
image.png (409×1 px, 65 KB)

For a large or huge wiki, the task graph looks like this:
image.png (573×2 px, 116 KB)