Page MenuHomePhabricator

Design a suitable DAG deployment method
Closed, ResolvedPublic

Description

As part of our work to T362788: Migrate Airflow to the dse-k8s cluster and T364387: Adapt Airflow auth and DAG deployment method, we have to decide (collectively) what method we would like to employ to make DAGs available to the Airflow instances on Kubernetes.

There is some useful background reading here: https://airflow.apache.org/docs/helm-chart/stable/manage-dags-files.html
This requirement to make this decision was also set out briefly in the Airflow - High Availability Strategy document.

We can no longer continue to use the existing deployment method, scap, since this requires SSH access to each Airflow instance to sync the DAG files. This is not feasible under Kubernetes.

There are three main options that are set out in the Airflow docs:

  1. Bake the DAGs into the container image
  2. Use a git-sync sidecar to populate the DAGs directory
  3. Mount the DAGs from an externally populated persistent volume

Option 2 is further broken down into:
2.1 Mounting DAGs using Git-Sync sidecar with Persistence enabled
2.2 Mounting DAGs using Git-Sync sidecar without Persistence

With option 2.1, the scheduler pod will be responsible for updating the persistent volume and this pod will be mounted by all of the webserver and worker pods.

With option 2.2, the scheduler pod, each webserver pod, and each worker pod will be responsible for running its own git-sync container and updating its own local copy of the DAGs.

The way that git-sync works, according to the docs, is by using atomic symlink swapping operations:

Part of the process of synchronization of commits from git-sync involves checking out new version of files in a freshly created folder and swapping symbolic links to the new folder, after the checkout is complete. This is done to ensure that the whole DAGs folder is consistent at all times. The way git-sync works with symbolic-link swaps, makes sure that Parsing the DAGs always work on a consistent (single-commit-based) set of files in the whole DAG folder.

There are some comprehensive notes on git-sync and persistence options.

The general guidance is:

use git-sync with local volumes only, and if you want to also use persistence, you need to make sure that the persistence solution you use is POSIX-compliant and you monitor the side-effects it might have.

However, they also say:

Depending on the technology behind the persistent volumes might handle git-sync approach differently and with non-obvious consequences. There are a lot of persistence solutions available for various K8S installations and each of them has different characteristics, so you need to carefully test and monitor your filesystem to make sure those undesired side effects do not affect you. Those effects might change over time or depend on parameters like how often the files are being scanned by the Dag File Processor, the number and complexity of your DAGs, how remote and how distributed your persistent volumes are, how many IOPS you allocate for some of the filesystem (usually highly paid feature of such filesystems is how many IOPS you can get) and many other factors.

In our case:

  • The most readily available persistence solution is our Ceph cluster, which is very much adjacent to the dse-k8s cluster.
  • We could potentially look at host-based local volumes.
  • The number and complexity of our DAGs is relatively low, given that each instance's DAGs are less than 1MB of plain text.
  • We do not want to put GitLab in the citical path of each pipeline run, if it can be avoided.

The code and documentation for git-sync is here: https://github.com/kubernetes/git-sync

git-sync can also be run with a webhook listener, so we could have a trusted GitLab runner notify the instance(s) of a merge to the main branch in the same way that we are planning to do for the GitLab HDFS Synchronizer project.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I think that my recommendation would be to go with option 2.1 Mounting DAGs using Git-Sync sidecar with Persistence enabled.

My feeling is that, at our scale and given its proximity to the dse-k8s cluster, Ceph will be just fine for these DAG volumes.
If we run into any issues with using network-based persistent volumes, then we could either:

  • Switch to local volumes on each dse-k8s worker

or

  • Switch off persistent volumes and use ephemeral volumes for holding DAGs, but be careful of how we respond to Gitlab downtime
BTullis triaged this task as Medium priority.Jun 20 2024, 11:57 AM
BTullis added subscribers: Ahoelzl, Ottomata, JAllemandou and 17 others.

Just curious, how often do the DAGs change? I think I would go with option 2.2, at least until we've run Ceph long enough to break it a few times and stabilize it ;) .

Just curious, how often do the DAGs change? I think I would go with option 2.2, at least until we've run Ceph long enough to break it a few times and stabilize it ;) .

Good question. The airflow-dags repository is shared across all teams, so looking at the the airflow-dags project statistics, the current average is 1.2 commits to main per day. The greatest number of commits in a day (so far) was 29.

image.png (355×1 px, 38 KB)

However, that doesn't equate to the number of deployments, since each team can deploy their own DAGs to their own instance at will.
Sometimes, these commits would be batched and deployed weekly, during the Data Engineering deployment train.

In some ways, the fundamental question is: do we want to move to a continuous deployment model, or do we want to retain manual deployment.
Do all teams feel the same way, or do we need to cater for flexibility?

My feeling is that, at our scale and given its proximity to the dse-k8s cluster, Ceph will be just fine for these DAG volumes.

Do we have any experience using Ceph and git-sync together? The Airflow doc talks about the persistent volume being POSIX-compliant and Ceph docs claim it's compatible with POSIX, so that's a good sign. As far as other drawbacks go of using git-sync with a persistent volume (network bursts, performance issues as number of DAGs grows, etc.)- I think our use case most likely won't be experiencing those.

In some ways, the fundamental question is: do we want to move to a continuous deployment model, or do we want to retain manual deployment.

If the continuous deployment worked reliably, I would imagine everyone would prefer it over manual deployment.

OK, do we have enough agreement to proceed along the lines of option 2? Use a git-sync sidecar to populate the DAGs directory.

We can decide later on whether to use an emptyDir or a persistent volume to hold the local copies of the DAGs, or we can re-evaluate as we get further down the road.

I agree that option 2 is probably a good default. The repo is small, we have a dedicated Ceph cluster. I'm not forseeing any issue, and we can always re-evaluate.

Per IRC conversation with @hashar last week, I think it would be prudent to invite Release-Engineering-Team into this conversation, as they will likely be most aware of the risks involved with continuous deployment model generally, and gitlab specifically.

RelEng, the security model offered by Airflow is here . Please let us know if you are able to offer any advice on this topic. We are happy to meet with y'all if we that is easier.

The discussion was on Friday July 12th and can be found in https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-releng/20240712.txt

The question was whether anyone used https://github.com/kubernetes/git-sync/tree/master . To me, it looks similar to having a service that blindly pull from a remote (similar to git::clone { ensure => latest }, T218900). The idea is to refresh workflows defined at https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags and Airflow natively supports that git-sync system.

The scary part is any merge to the repo end up straight in production which sounds super scary, it felt to me as a great help for a development environment but to be super scary for a production environment.

@bking said:

now that I think of it, an authenticated human still has to run the DAG manually. So the exposure shouldn't be worse than what we have already

And eventually I have concluded I am not familiar with either continuous deployment or Airflow and proposed it is raised to the team ;)

an authenticated human still has to run the DAG manually

I don't think this is true. Maybe a new DAG needs to be manually launched (I don't think so?) but changes to an existent DAG will be applied on the next run.

scary part is any merge to the repo end up straight in production

This is true, but I think that is the intention. The goal is to enable continuous deployment.

A similar project is T365659: Implement automated deployment of refinery HQL files to HDFS (via blunderbuss), except instead of git-sync, it will use a gitlab CI job on a merge to trigger a deployment. For this, the scariness can be mitigated by using versioned artifacts in actual jobs (instead of the latest version always).

For this airflow-dags deployment stuff, that's not an option. Its all or nothing.

which sounds super scary

I agree it is a little scary, but I think in practice it won't be much different than what happens now: after merge, a person has to manually deploy. I think they usually do so pretty quickly. This will just remove a manual step.

Thanks @hashar and @Ottomata for following up. Based on this exchange as well as the comments in the Gitlab HDFS synchronizer design doc , I think DPE SRE and Releng are on the same page as far as the potential security issues posed by this CD implementation.

As such, I'm going to consider this matter closed, but please respond if you're reading this and you think DPE SRE and Releng need to re-open this discussion.

Hm, thought:

Currently, when airflow-dags is deployed, a scap hook is fired that synchronizes declared artifacts to HDFS.

E.g.

See: https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Developer_guide#Artifacts

If airflow-dags is moved to a git-sync based deployment, something else will need to trigger artifact -> HDFS sync.

The mechanism being proposed in T365659: Implement automated deployment of refinery HQL files to HDFS (via blunderbuss) will use GitLab CI to trigger a sync of a directory in a git repository to HDFS. This will not trigger the existent artifact syncing mechanism that is used to e.g. deploy conda or jar artifacts from package registries.

We might be able to piggy back on the 'GitLab CI triggers something in prod` functionality being implemented in T365659 to run the artifact sync?

Or, maybe simpler: after git-sync runs, if repo is updated, could it launch a script to do the sync?

The scary part is any merge to the repo end up straight in production which sounds super scary, it felt to me as a great help for a development environment but to be super scary for a production environment.

I have to say, upon reflection, that I share your concerns here @hashar.

Whilst I like the idea of implementing a full GitOps based workflow for engineers, I would be reticent to offload the total responsibility for authentication to the GitLab security model.

Airflow is quite clear that we are going to be putting executable files into production, along with a schedule or trigger condition to automate their execution.
https://airflow.apache.org/docs/apache-airflow/stable/security/security_model.html#capabilities-of-dag-authors

DAG authors are able to submit code - via Python files placed in the DAGS_FOLDER - that will be executed in a number of circumstances. The code to execute is neither verified, checked nor sand-boxed by Airflow (that would be very difficult if not impossible to do), so effectively DAG authors can execute arbitrary code on the workers (part of Celery Workers for Celery Executor, local processes run by scheduler in case of Local Executor, Task Kubernetes POD in case of Kubernetes Executor), in the DAG File Processor (which can be either executed as standalone process or can be part of the Scheduler) and in the Triggerer.

The code is only supposed to be able to run in the context of the Executor processes, which have limited rights, but a malicious DAG author could still do a lot of damage if they were able to delete or corrupt files on the Data Lake, for example. There are also potential vulnerabilities to consider, such as this most recent one, which we've just fixed. That vulnerability would potentially allow a DAG author to run code in the context of the scheduler, which should not be allowed. In our current instances that wouldn't represent an elevation of privileges, but under Kubernetes it could well be.

So I think that I would be most comfortable if we came up with a system like the one we use to deploy the Puppet (puppet-merge) or DNS (authdns-update) systems.

Specifically, that we have a standard procedure to:

  1. An engineer merges their feature branch to the main branch in the airflow-dags repo in GitLab
  2. Straight away, the same engineer runs a command (e.g. airflow-update) on a deployment server, which runs git-sync against the affected airflow instance(s).

I believe that this would still support a deploy-at-will model, but it just places an additional manual step in the way, which authenticates the deployer via their production SSH connection.

Or, maybe simpler: after git-sync runs, if repo is updated, could it launch a script to do the sync?

I like this thought the best, I think. If we hook up the git-sync mechanism to the HDFS sync mechanism, then we should be able to guarantee a consistent deployment environment with the minimum overhead.

Change #1070024 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] global_config: define an external-services entry for gitlab.wikimedia.org

https://gerrit.wikimedia.org/r/1070024

Change #1070026 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: perform an initial git-sync of the dags in an init container

https://gerrit.wikimedia.org/r/1070026

While we haven't settled on a proper delivery mechanism, I've added an initial git-sync step in the initcontainer of the airflow webserver. This will allow us to pull the DAGs once and only once per pod runtime, and we will work on figuring out exactly how we'd like to sync the DAGs without having to redeploy the airflow chart.

Change #1070024 merged by Brouberol:

[operations/puppet@production] global_config: define an external-services entry for gitlab.wikimedia.org

https://gerrit.wikimedia.org/r/1070024

Change #1070026 merged by Brouberol:

[operations/deployment-charts@master] airflow: perform an initial git-sync of the dags in an init container

https://gerrit.wikimedia.org/r/1070026

We have configured git-sync to use a pretty recent git feature: sparse checkouts, allowing to clone a subpart of a repository. While initially envisioned to be used to checkout a small subpart of a large/huge monorepo (to keep git reasonably fast), we use it to selectively clone parts of the airflow-dags repository.

  • we always clone wmf_airflow_common
  • we clone the folder associated with the airflow instance (analytics, analytics_test, etc). This way, we ensure that a given airflow instance does not get the DAGs intended for other instances, and thus can't execute them.

Here are the git-sync logs:

brouberol@deploy1003:~$ kubectl logs airflow-webserver-69bcfb9444-c9v4b -c airflow-production-git-sync -f
brouberol@deploy1003:~$ k logs airflow-webserver-67c4994d69-74tzc -c airflow-production-git-sync
INFO: detected pid 1, running init handler
Flag --branch has been deprecated, use --ref instead
{"logger":"","ts":"2024-09-02 19:15:00.128004","caller":{"file":"main.go","line":397},"level":0,"msg":"setting --ref from deprecated --branch"}
{"logger":"","ts":"2024-09-02 19:15:00.128138","caller":{"file":"main.go","line":568},"level":0,"msg":"starting up","version":"4.2.4","pid":11,"uid":900,"gid":900,"home":"/home/runuser","flags":["--depth=1","--link=airflow_dags","--one-time=true","--repo=https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags.git","--root=/dags","--sparse-checkout-file=/etc/gitsync/sparse-checkout.conf"]}
{"logger":"","ts":"2024-09-02 19:15:00.129811","caller":{"file":"main.go","line":673},"level":0,"msg":"git version","version":"git version 2.39.2"}
{"logger":"","ts":"2024-09-02 19:15:00.233495","caller":{"file":"main.go","line":1168},"level":0,"msg":"repo directory was empty or failed checks","path":"/dags"}
{"logger":"","ts":"2024-09-02 19:15:00.233543","caller":{"file":"main.go","line":1178},"level":0,"msg":"initializing repo directory","path":"/dags"}
{"logger":"","ts":"2024-09-02 19:15:02.029427","caller":{"file":"main.go","line":1682},"level":0,"msg":"update required","ref":"main","local":"","remote":"5315c8d97dc4b6931fd190ec0743b425b2a18760","syncCount":0}
{"logger":"","ts":"2024-09-02 19:15:02.730781","caller":{"file":"main.go","line":1728},"level":0,"msg":"updated successfully","ref":"main","remote":"5315c8d97dc4b6931fd190ec0743b425b2a18760","syncCount":1}
{"logger":"","ts":"2024-09-02 19:15:02.730911","caller":{"file":"main.go","line":906},"level":0,"msg":"exiting after one sync","status":0}

We get the following dags file:

runuser@airflow-webserver-67c4994d69-74tzc:/opt/airflow$ find dags/airflow_dags/ -maxdepth 2
dags/airflow_dags/
dags/airflow_dags/wmf_airflow_common
dags/airflow_dags/wmf_airflow_common/templates
dags/airflow_dags/wmf_airflow_common/operators
dags/airflow_dags/wmf_airflow_common/util.py
dags/airflow_dags/wmf_airflow_common/easy_dag.py
dags/airflow_dags/wmf_airflow_common/plugins
dags/airflow_dags/wmf_airflow_common/sensors
dags/airflow_dags/wmf_airflow_common/artifact.py
dags/airflow_dags/wmf_airflow_common/pool.py
dags/airflow_dags/wmf_airflow_common/config
dags/airflow_dags/wmf_airflow_common/partitions_builder.py
dags/airflow_dags/wmf_airflow_common/__init__.py
dags/airflow_dags/wmf_airflow_common/dataset.py
dags/airflow_dags/wmf_airflow_common/metrics
dags/airflow_dags/wmf_airflow_common/hooks
dags/airflow_dags/analytics_test
dags/airflow_dags/analytics_test/dags
dags/airflow_dags/analytics_test/config
dags/airflow_dags/.git

Now that we clone wmf_common_airflow on disk, this unblocks statsd metrics collection, as we rely on a custom statsd client, defined under wmf_airflow_common/metrics/custom_statsd_client.py.

In some ways, the fundamental question is: do we want to move to a continuous deployment model, or do we want to retain manual deployment.

If the continuous deployment worked reliably, I would imagine everyone would prefer it over manual deployment.

I've been thinking about this as well as talking with @Gehel. At the heart of our DAG deployment model lies git-sync. I'm not even going to mention whether we use persistence or not, for the sake or the delivery argument. What git-sync does is that it regularly pulls from a git repository ref (most likely a branch). You can also run it as a one-off script, or have it pull on a schedule but trigger an immediate pull using a UNIX signal.

Some users have suggested adding a REST API to trigger a pull, which has been denied. The author didn't want the scope creep of authorizing the API.

So, let's say we wanted to replicate something à la puppet-merge, where we first merge a patch, and then trigger a manual command to approve/deny the patch deployment. We'd need to build a command that would need to be runnable by Data Engineers, that would:

  • figure out what ref was last pulled in a given airflow scheduler pod
  • figure out what ref was last merged in gitlab
  • perform a diff
  • ssh onto the worker node that current runs the scheduler pod
  • sends a SIGHUP signal to the actual git-sync container in the scheduler pod

That departs from our current security model, as we'd need to relax SSH access to the worker nodes. I also feels that it's quite convoluted.

The alternative approach that I've been thinking about, that I feel is much more in line with how git-sync works is:

  • prevent any push force on the airflow-dags main branch (we can always have a break glass procedure if we absolutely need to)
  • either require 2 approvals for each airflow-dags PR, or at least requiring the approval of someone else than the committer
  • once we get the necessary amount of approvals, the PR is merged, and the new commits are pulled by git-sync in a timely fashion.

I think this is no less auditable that the solution I described earlier, as the whole state is in git, If we wanted to restrict who is able to approve DAG changes to certain areas of the repo, we could also rely on https://docs.gitlab.com/ee/user/project/codeowners/

I'm obviously open for a counter argument, but I'll say this: at its core, git-sync is a continuous delivery tool. If we plan to not have such a delivery process, then maybe we should design something else.


I'm going to keep this ^ argument ^ for the sake of transparency, but I've realized that the Gitlab features we'd need (multiple required approvals, codeowners, etc) are Premium features: https://docs.gitlab.com/ee/user/project/merge_requests/approvals/settings.html

Another idea would be to fork git-sync and introduce a REST API through which we could trigger a sync. Having looked at the code, it shouldn't be too difficult to do. However, to be able to display the diff between what's already deployed and what would be deployed, we'd also need to introduce another REST API that would expose the sha1 of the last deployed commit.

The deployment could look something like this:

  • merge gitlab PR
  • connect to, say, the deployment host
  • run e.g.airflow-dags-merge --instance airflow-analytics that would
    • query the git-sync REST API for the currently deployed commit
    • query the gitlab API for the would be deployed commit
    • display a diff
    • if the operator enters yes, then send a POST to the git-sync REST API , thus triggering a git pull

Note that this is nontrivial, as we can't run git-sync as a sidecar. The way ingress and routing are configured, we can only have a single routable domain per chart. While being an implementation detail, this might prove a pretty sizeable blocker as well.

@brouberol I know this is not exactly the same thing, but there must be some synergy between this and T365659: Implement automated deployment of refinery HQL files to HDFS (via blunderbuss).

Most of the requirements there seem the same, except for the final destination of HDFS. cc @amastilovic

Final idea of the evening: if we require a 2-step deploy, then we could simply redeploy the scheduler with the sha1 ref of airflow-dags we'd like to pull, and we'd simply pull it once with an init container, and forego with any continuous, regular pulls.

Additional thought: once we migrate to KubernetesExecutor instead of LocalExecutor, the dags repo would get cloned by git-sync at the start of each task (ie Pod), causing us to have a hard reliance on gitlab being up. I think we should have an asynchronous process that clones the repo via git-sync (using a sha, and not a branch name), and exports the data to, say, s3. Then "pulling the DAGs repo" would basically become a s3 cp operation, causing us to move the dependency from gitlab to Ceph/RadosGW.

Additional thought: once we migrate to KubernetesExecutor instead of LocalExecutor, the dags repo would get cloned by git-sync at the start of each task (ie Pod), causing us to have a hard reliance on gitlab being up. I think we should have an asynchronous process that clones the repo via git-sync (using a sha, and not a branch name), and exports the data to, say, s3. Then "pulling the DAGs repo" would basically become a s3 cp operation, causing us to move the dependency from gitlab to Ceph/RadosGW.

My HDFS synchronizer project would be perfect for this.

@amastilovic I'm interested! Do you have something I could read to know more?

What @Stevemunene @BTullis and myself talked about today was the following.

  • The airflow-dags CI triggers the gitlab synchronizer as the last CI step when running on the main branch
  • The synchronizer synchronizes the state of the main branch of airflow-dags to Ceph, by writing to an S3 bucket (or to Ceph directly via https://docs.ceph.com/en/latest/rados/api/python/)

This is where we start having different ideas.

Idea 1: the synchronizer writes to Ceph directly
The synchronizer writes to Ceph, in a location exposed by a PersistentVolume, mounted in the scheduler pod as well as any DAG task pod. This results in the DAG being immediately made available to the scheduler.
However, we lose the DAG isolation per instance, as each pod would be able to read (and execute) all DAGs.

Idea 2: the synchronizer writes to Ceph and another component performs the equivalent of a sparse checkout
The synchronizer writes the full content of airflow-dags to Ceph (directly or to an S3 bucket backed by Ceph). Another process running regularly, as part of the airflow instance k8s deployment, would eventually pick up on the changes and would then write the content of both the wmf_airflow_common and the dag folder associated to the Airflow instance (ex: test_k8s/ for airflow-test-k8s) to a location Ceph exposed by a PersistentVolume mounted in the scheduler pod as well as any DAG task pod.

This has the advantage of isolating the DAGs on their associated instance, but it adds a bit of delay to the DAG propagation, as we need bot the Gitlab synchronizer and a custom process to successfully run to get the changes into the scheduler.

From @BTullis slack message, it turns out we can't mount a PV in multiple locations:

Following on from our conversation in the sync about airflow-DAGs deployment, I was wrong about being able to have one pod write and many nodes read.
https://docs.ceph.com/en/reef/rbd/rbd-kubernetes/#create-a-persistentvolumeclaim
Using ceph-csi, specifying Filesystem for volumeMode can support both ReadWriteOnce and ReadOnlyMany accessMode claims, and specifying Block for volumeMode can support ReadWriteOnce, ReadWriteMany, and ReadOnlyMany accessMode claims.
All of our current persistent volumes use ReadWriteOnce, which means that all pods on a single node can read/write to the volume concurrently.
https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modesSo we could use this for the dag distribution, but we would have to have a PVC per node.
Or we could use something else, like syncing from an S3 bucket to an emptyDir as an init-container.

Assuming we implement idea 2 from https://phabricator.wikimedia.org/T368033#10187122, I'd have the 2 DAG folders (wmf_airflow_common and the airflow instance dag folder) be written to Ceph so that they are automatically picked up by the scheduler. When a DAG task pod starts, an initContainer would copy the data from Ceph to a local emptyDirvolume, so that it can be used while the pod runs.

However, we lose the DAG isolation per instance, as each pod would be able to read (and execute) all DAGs.

Technically this is true, and this is also how things work now. Airflow's DAG directory is explicitly configured per instance though, so I don't think there is really a worry about having all of the different instance DAG files deployed to every instance. In the ideal multi-tenant future, there'd only be one instance anyway, and all DAGs would be on it.

^ is not a comment on the merit of the two ideas though, just wanted to state that I don't think DAG file isolation is a requirement.

Change #1081922 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: define a cephfs PVC storing the DAGs

https://gerrit.wikimedia.org/r/1081922

Change #1081922 merged by Brouberol:

[operations/deployment-charts@master] airflow: define a cephfs PVC storing the DAGs

https://gerrit.wikimedia.org/r/1081922

brouberol opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/883

test_k8s: tweak Datahub_Cleanup to make sure DAGs are pull automatically

brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/883

test_k8s: tweak Datahub_Cleanup to make sure DAGs are pull automatically

We have deployed changes to the airflow k8s chart ensuring that DAG changes are automatically pulled and serialized within 5 minutes. This seemed to be a highly requested feature from the Data Engineering teams. That puts a lot of responsibility of the merger. While we can't leverage the "multiple reviewers required" feature, that is in Gitlab Enterprise, I think we should still have a conversation about potential restrictions on who gets to merge. or any other potential risk mitigation practices.

@brouberol automatic sync from Ceph every 5 minutes might cause some issues with the HDFS synchronizer. We need to think about the scenario where your 5 minute sync starts in the middle of HDFS synchronization to Ceph - in that case, you will only get a part of airflow-dag repository and Airflow will most likely choke. Sure, it would fix itself on the next run within 5 minutes but it would be much better if we could mitigate this scenario somehow.

automatically pulled and serialized within 5 minutes

Not sure if I missed this somewhere, but how will artifact file syncing be triggered?
https://phabricator.wikimedia.org/T368033#10032420

Currently, this happens via a scap hook.

@amastilovic sorry, I wasn't clear. We currently sync the DAGs every 5 minutes with git-sync as a stopgap, while the HDFS synchronizer isn't running. When it is, we'll retire that git-sync setup, and will fully rely on it.

@brouberol got it. You'll mount Ceph as a file system local to the Airflow instance, and HDFS sync will write to Ceph - effectively, to Airflow, this will look like a local fs directory being updated.

That's exactly right! And until that materializes, we're running a git-sync process that syncs airflow-dags to the Ceph volume every 5 minutes, as a stopgap.

We had a very interesting talk with @Gehel @hashar @gmodena @dcausse and @JAllemandou this morning, and thanks to them, I think I found a solution that would satisfy everyone. I'm obviously asking for feedback, whether positive or negative, but here it is.

Context

First off, let me sum up the different requirements:

  • Data Engineers would like DAG deployments to be automated
  • SRE would like DAG deployments to be "secure" (aka: no single person can merge their own MR and have their change automatically deployed to production)

@amastilovic is currently building https://gitlab.wikimedia.org/repos/data-engineering/hdfs-synchronizer/ (codename/newname "blunderbuss"), which job will be to synchronize code and artifacts when its API is called. The general design was to have it be called by a Gitlab hook once an MR is merged to the main branch. In the case of Airflow, blunderbuss would synchronize the DAGs to a cephfs volume, also mounted in the airflow scheculers, causing the changes to be automatically picked up by airflow.

What I initially suggested in https://phabricator.wikimedia.org/T368033#10131748 was to configure the gitlab repo so that an MR would require multiple reviewers before being mergeable, but that is a Premiun feature, whereas we run Gitlab Opensource.

I broached the subject of using gerrit (in which such a requirement is easy to implement), but got some pushback from the Data Engineers, who would like to keep using Gitlab, as it suits their needs.

With all of that in mind, here's what I suggest.

Suggestion

We could introduce a simple CI job that would run a script leveraging the Gitlab API to ensure that multiple users have approved the MR. Here's proof of concept of such a script:

#!/usr/bin/env python3

import os
import gitlab

MIN_APPROVALS = 2

approvals = set()

gl = gitlab.Gitlab('https://gitlab.wikimedia.org', private_token=os.environ['GITLAB_TOKEN'])
project = gl.projects.get(os.environ['CI_MERGE_REQUEST_PROJECT_PATH'])
mr = project.mergerequests.get(int(os.environ['CI_MERGE_REQUEST_ID']))
for discussion in mr.discussions.list():
    for note in discussion.asdict()['notes']:
        if note['system'] is True and note['body'] == 'approved this merge request':
            approvals.add(note['author']['username'])

if len(approvals) < MIN_APPROVALS:
    print("At least 2 distinct approvals are required to merge this MR")
    sys.exit(1)

We'd execute this script as part of the CI jobs that run against feature branches. This way, everybody would get what they want (I think).

  • Data engineers would keep on using Gitlab, and get a continuous delivery system for their DAGs
  • Data Platform SREs would be satisfied by the security model around DAG deployment

What do y'all think? cc-ing @Ottomata

Edit: I pushed the PoC to Gitlab https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/933

Data Engineers would like DAG deployments to be automated

SRE would like DAG deployments to be "secure" (aka: no single person can merge their own MR and have their change automatically deployed to production)

What do y'all think? cc-ing @Ottomata

Weak opinion: I don't see a lot of security gained by always requiring 2 approvers. I'd personally rather this be best judgement and culture rather than hard requirements. I'd rather have folks be able to quickly deploy fixes if no one is around, than add bueracracy requiring them to wait.

But! That is a weak opinion and I will leave it to SREs etc. to decide this.

Stronger opinion: I just ask that the same deployment mechanism and criteria be used for job repos as well as airflow-dags.

I don't think requiring 2 approvers for airflow-dags will add much security if job repos and artifacts will be automatically deployed (and used) by running jobs.

BTW, this is why I was asking this

https://phabricator.wikimedia.org/T368033#10251511

Did you call get to talk about that? How will job artifacts be deployed in the k8s and non-scap world?

I'd personally rather this be best judgement and culture rather than hard requirements

The issue we were trying to address waw: how do we balance out security and convenience? In the initial situation, anyone could self-merge, but then someone would then have to ssh to a deployment host and perform a scap deploy. Data Engineers have pushed to drop that step in favor of DAG continuous delivery, meaning that we could have been in a situation where anyone with a maintainers account could self-merge anything and have it be automatically deployed to production.

Best judgment and culture are one thing, but if one maintainers Gitlab account gets compromised, we at least have one extra step and safety check in place.

How will job artifacts be deployed in the k8s and non-scap world?

My understanding is that the similar kind of setup will be used for both airflow-dags and artifacts, ie blunderbuss, which would synchronize assets to Ceph and/or HDFS. Invoking @amastilovic to confirm.

we could have been in a situation where anyone with a maintainers account could self-merge anything and have it be automatically deployed to production.

Aye true, and I suppose a difference is that previously, an airflow instance was only deployed to manually if a user intended to make changes to their instance. In the k8s world, any change to airflow-dags will result in a deploy to all instances automatically, right?

My understanding is that the similar kind of setup will be used for both airflow-dags and artifacts

IIUC, the intention for the blunderbuss project is to enable CD for jobs (HQL, artifacts, etc.). The jobs will be automatically deployed on merge in their repos (outside of airflow-dags).

This means that it will be possible for a single user to merge a change to e.g. analytics/refinery and have the HQL used to run the job automatically deployed.

So, my reason for asking (here) is about the difference in security needs between job repos and airflow-dags. Should we set up 2-approvers CI for job repos too?

First off, one thing I wasn't sufficiently clear about is that you're still free to self-approve, so the check really makes sure someone else has reviewed and approve the merge request.

In the k8s world, any change to airflow-dags will result in a deploy to all instances automatically, right?

That is correct.

IIUC, the intention for the blunderbuss project is to enable CD for jobs.

In https://phabricator.wikimedia.org/T368033#10131836, we started a conversation about how to use blunderbuss for both jobs and DAGs. The tool does exactly what we want it to do, and did fit our need perfectly, in terms of deployment. The issue we're trying to address, like I mentioned in https://phabricator.wikimedia.org/T368033#10344884 is balancing convenience and security. If a maintainer's Gitlab account is compromised, it would be enough to be able to send malicious code to production without any oversight nor peer review.

My opinion is that we'd need to same kind of oversight wherever we use continuous delivery. So if we're planning to use blunderbuss for jobs, then I think we'd need to ensure any code shipped to production is reviewed, yes.

After multiple slack discussions (https://wikimedia.slack.com/archives/C02291Z9YQY/p1732524489017109 and https://wikimedia.slack.com/archives/CSV483812/p1732609234467399), we have slightly altered the check:

  • we added some documentation
  • we added an escape hatch explaining how to deploy without any code review
  • we reworded "you need 2 approvals" to "you need one peer review" as it was always possible to self approve

Thanks to everyone who challenged the initial implementation!