Page MenuHomePhabricator

Allow certain DAGs to be ignored when creating an airflow development environment
Closed, ResolvedPublic

Description

When working on writing/modifying an Airflow DAG, it has become customary to rely on development environments (also called airflow devenvs) to perform end-to-end testing of the patch being worked on.

These instances can be created and deleted at will, and will be configured as close as the original instance. For example, when creating an airflow devenv when working on a DAG under the test_k8s root folder, the configuration for the airflow-test-k8s instance will be used, in order for the devenv instance to behave as close as the "real" one.

One recurring issue that we've seen is the time it takes to serialize all dags defined under that root folder, especially for instances with a lot of DAGs defined.

To ensure the fastest feedback loop possible, we'd need to introduce a flag to the airflow-devenv create command that would specify the path of the DAG currently being worked on, which would then be injected as a helm value, and used by the airflow chart to restrict the set of files being pulled by gitsync when performing a sparse checkout.

Ideally, we could also infer this value at runtime by listing the files being modified in the branch being passed to airflow-devenv create --branch and automatically do The Right Thing ™ . If we're going this way, we may even think about inferring the value of the airflow-devenv create --dags-folder argument the same way.

Definition of done:

  • A new airflow-devenv version is deployed where a user is given the option to specify the DAGs they'd like to be serialized
  • A demo is communicated broadly (either pre-recorded or live) to the DPE population
  • Instructions are added to the airflow-devenv README and wiki as to how to test changes in the CLI before releasing them as a new deb version

Additional reading:

Details

Event Timeline

Thank you for tagging this task with good first task for Wikimedia newcomers!

Newcomers often may not be aware of things that may seem obvious to seasoned contributors, so please take a moment to reflect on how this task might look to somebody who has never contributed to Wikimedia projects.

A good first task is a self-contained, non-controversial task with a clear approach. It should be well-described with pointers to help a completely new contributor, for example it should clearly point to the codebase URL and provide clear steps to help a contributor get set up for success. We've included some guidelines at https://phabricator.wikimedia.org/tag/good_first_task/ !

Thank you for helping us drive new contributions to our projects <3

atsuko changed the task status from Open to In Progress.EditedApr 7 2026, 2:52 PM
  1. located the definition of sparse checkout file
  2. testplan: ~/airflow-devenv/airflow_devenv/cli.py create --charts-dir ~/deplo...
  3. checking out selected set of files need to consider all dependencies, which is generally impossible (but I'll assume no-one will do dynamic includes), and also non-deterministic (better be done with strict and hermetic build systems). I am researching if
    • there is a better way to limit DAG visibility in the airflow,
    • there is a way to quick-check dependencies.

Change #1268951 had a related patch set uploaded (by Atsuko; author: Atsuko):

[operations/deployment-charts@master] airflow: dag filter helper function

https://gerrit.wikimedia.org/r/1268951

I'm using config.core.might_contain_dag_callable hook to filter through all the files airflow scheduler considers for serialisation. This approach doesn't limit what files will materialise from git, therefore user can safely depend on any file inside the repo,

There is a airflow-devenv counter-part almost up for review, I'm finishing actual filtering code in the airflow deployment and it will be ready today or tomorrow.

Change #1268951 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: dag filter helper function

https://gerrit.wikimedia.org/r/1268951

Todo:

  1. Update docs in README and on wikitech
  2. Release the deb
  3. Rollout the deb
  4. Announce and demo
atsuko@apt1002:~$ sudo -i reprepro ls airflow-devenv
airflow-devenv | 0.0.20 | bullseye-wikimedia | amd64, i386
airflow-devenv | 0.0.20 | bookworm-wikimedia | amd64, i386

Recorded a demo, gonna present on DPE meetings this and next week

atsuko triaged this task as Low priority.
atsuko closed this task as Resolved.