We'd like to know if the scheduler can process the dumps v1 DAGs faster without dynamic task mapping (we iterate over the list of fetched wikis and create tasks at runtime). Let's try to hardcode the list of wikis in the code and do without dynamic task mapping, to see whether we witness performance improvements.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T88728 Improve Wikimedia dumping infrastructure | |||
| Resolved | BTullis | T352650 WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes | |||
| Resolved | brouberol | T388378 Orchestrate dumps v1 from an airflow instance | |||
| Resolved | brouberol | T390945 Run an experimental dump of 200 regular sized wikis | |||
| Resolved | brouberol | T391483 Experiment with disabling dynamic task mapping | |||
| Resolved | brouberol | T391678 Adjust the file processing interval |
Event Timeline
Disabling task mapping seems to have had quite the effect on the scheduler loop time (how much time it takes the scheduler to perform a "tick" of work).
left: with dynamic task mapping, right: without
The tradeoff here is that the scheduler must perform more work at DAG processing time (as showcased in the next screenshot), but I think this is totally acceptable, as once development work on that DAG settles down, we could configure the scheduler to re-process the DAG every, say, 10 minutes, instead of the default 30s.
One added bonus is that the DAG grid representation is much more convenient: each wiki has its associated foldable TaskGroup, which makes it very easy to follow per-wiki progress over time.
brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1232
Statically define the dumps DAG instead of relying on dynamic mapping



