As part of our work to T362788: Migrate Airflow to the dse-k8s cluster and T364387: Adapt Airflow auth and DAG deployment method, we have to decide (collectively) what method we would like to employ to make DAGs available to the Airflow instances on Kubernetes.
There is some useful background reading here: https://airflow.apache.org/docs/helm-chart/stable/manage-dags-files.html
This requirement to make this decision was also set out briefly in the Airflow - High Availability Strategy document.
We can no longer continue to use the existing deployment method, scap, since this requires SSH access to each Airflow instance to sync the DAG files. This is not feasible under Kubernetes.
There are three main options that are set out in the Airflow docs:
- Bake the DAGs into the container image
- Use a git-sync sidecar to populate the DAGs directory
- Mount the DAGs from an externally populated persistent volume
Option 2 is further broken down into:
2.1 Mounting DAGs using Git-Sync sidecar with Persistence enabled
2.2 Mounting DAGs using Git-Sync sidecar without Persistence
With option 2.1, the scheduler pod will be responsible for updating the persistent volume and this pod will be mounted by all of the webserver and worker pods.
With option 2.2, the scheduler pod, each webserver pod, and each worker pod will be responsible for running its own git-sync container and updating its own local copy of the DAGs.
The way that git-sync works, according to the docs, is by using atomic symlink swapping operations:
Part of the process of synchronization of commits from git-sync involves checking out new version of files in a freshly created folder and swapping symbolic links to the new folder, after the checkout is complete. This is done to ensure that the whole DAGs folder is consistent at all times. The way git-sync works with symbolic-link swaps, makes sure that Parsing the DAGs always work on a consistent (single-commit-based) set of files in the whole DAG folder.
There are some comprehensive notes on git-sync and persistence options.
The general guidance is:
use git-sync with local volumes only, and if you want to also use persistence, you need to make sure that the persistence solution you use is POSIX-compliant and you monitor the side-effects it might have.
However, they also say:
Depending on the technology behind the persistent volumes might handle git-sync approach differently and with non-obvious consequences. There are a lot of persistence solutions available for various K8S installations and each of them has different characteristics, so you need to carefully test and monitor your filesystem to make sure those undesired side effects do not affect you. Those effects might change over time or depend on parameters like how often the files are being scanned by the Dag File Processor, the number and complexity of your DAGs, how remote and how distributed your persistent volumes are, how many IOPS you allocate for some of the filesystem (usually highly paid feature of such filesystems is how many IOPS you can get) and many other factors.
In our case:
- The most readily available persistence solution is our Ceph cluster, which is very much adjacent to the dse-k8s cluster.
- We could potentially look at host-based local volumes.
- The number and complexity of our DAGs is relatively low, given that each instance's DAGs are less than 1MB of plain text.
- We do not want to put GitLab in the citical path of each pipeline run, if it can be avoided.
The code and documentation for git-sync is here: https://github.com/kubernetes/git-sync
git-sync can also be run with a webhook listener, so we could have a trusted GitLab runner notify the instance(s) of a merge to the main branch in the same way that we are planning to do for the GitLab HDFS Synchronizer project.
