When working on writing/modifying an Airflow DAG, it has become customary to rely on development environments (also called airflow devenvs) to perform end-to-end testing of the patch being worked on.
These instances can be created and deleted at will, and will be configured as close as the original instance. For example, when creating an airflow devenv when working on a DAG under the test_k8s root folder, the configuration for the airflow-test-k8s instance will be used, in order for the devenv instance to behave as close as the "real" one.
One recurring issue that we've seen is the time it takes to serialize all dags defined under that root folder, especially for instances with a lot of DAGs defined.
To ensure the fastest feedback loop possible, we'd need to introduce a flag to the airflow-devenv create command that would specify the path of the DAG currently being worked on, which would then be injected as a helm value, and used by the airflow chart to restrict the set of files being pulled by gitsync when performing a sparse checkout.
Ideally, we could also infer this value at runtime by listing the files being modified in the branch being passed to airflow-devenv create --branch and automatically do The Right Thing ™ . If we're going this way, we may even think about inferring the value of the airflow-devenv create --dags-folder argument the same way.
Definition of done:
- A new airflow-devenv version is deployed where a user is given the option to specify the DAGs they'd like to be serialized
- A demo is communicated broadly (either pre-recorded or live) to the DPE population
- Instructions are added to the airflow-devenv README and wiki as to how to test changes in the CLI before releasing them as a new deb version
Additional reading:
