Page MenuHomePhabricator

Experiment with absolute priority weights
Closed, ResolvedPublic

Description

According to https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/priority-weight.html (as well as a conversation I had on the airflow slack):

absolute: The effective weight is the exact priority_weight specified without additional weighting. You may want to do this when you know exactly what priority weight each task should have. Additionally, when set to absolute, there is bonus effect of significantly speeding up the task creation process as for very large DAGs.

We might benefit from setting an absolute prority_weight of 1 to each dumps tasks, as it might make the DAG DB serialization much faster.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
test_k8s/dumpsv1: set an absolute weight on each dumps task to speed up schedulingrepos/data-engineering/airflow-dags!1228brouberolT391480main
Customize query in GitLab

Event Timeline

brouberol triaged this task as Medium priority.

The following experiment shows 2 subsequent DAG runs for aawikionly. The first one is without absolute weight rule, and the second one is with.

Screenshot 2025-04-10 at 10.15.08.png (642×1 px, 177 KB)

I'm not 100% sure whether enabling absolute weight rules has made the DAG faster or not. I'm going to delete the whole data on ceph, restart the DAG without absolute weight rules, and start all over again with the weight rules.

What we see though is that the DAG dependency checks are clearly responsible for the scheduler loop time increasing. Which means that the DAG is "complex". That probably ties into T391483 more than this ticket though.

I think that the effect of the fixed priority should show when inserting a large DAG, aka at the very start. A DAG of a 30 tasks might not show quite the affect as would a DAG of 2000. That being said, we kind of see an effect of using absolute weights, which reduces the initial peak of scheduling loop time on the 1st and 3rd experiments (without absolute weights).

Screenshot 2025-04-10 at 11.42.40.png (628×1 px, 203 KB)

I'll now experiment with more wikis.

What is becoming clear is that using an absolute weight rule allows the scheduler to perform its initial work faster (basically creating all the tasks with the appropriate metadata in DB). We've been seeing a ~40% scheduler loop time reduction when starting the DAG for 16 wikis when relying on absolute weights. It's not a game changer for such a long DAG, but we'll take it anyway.

Screenshot 2025-04-10 at 20.58.19.png (2×2 px, 533 KB)

Left is without absolute weights, and right is with

NOTE: that 40% might increase as we increase the amount of wikis to dump in the DAG
brouberol changed the task status from Open to In Progress.Apr 10 2025, 7:03 PM

brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1228

test_k8s/dumpsv1: set an absolute weight on each dumps task to speed up scheduling