Page MenuHomePhabricator

Setup for allowing Airflow deployment via Git Repository
Closed, ResolvedPublic5 Estimated Story Points

Description

User Story
As a Data Engineer, I need to configure Airflow to allow data pipeline producers the ability to deploy via Git repository artifacts
Why?

So that data pipeline producers can use their own repositories to manage code and not refinery

Success Criteria

  • Simple job runs that is configured to point to hql file in repository that isn't refinery
  • Documentation (especially versioning)

Event Timeline

+1 to such a mechanism. I was discussing a similar mechanism with @Htriedman recently. There is no need for other folks to be blocked by refinerys weekly release cadence.

JArguello-WMF set the point value for this task to 5.
xcollazo changed the task status from Open to In Progress.Apr 10 2023, 3:09 PM
xcollazo moved this task from Next Up to In Progress on the Data Pipelines (Sprint 11) board.

Some possible solutions:

(A) Load artifacts from a project's conda-pack tarball.

We did this for a couple artifacts for the Structured Data team. Here is an example. Basically we declare in the pyspark project that we have artifacts that need to be loaded from the tarball.

Pros: Simple, already available mechanism that will be versioned automatically.

Cons: This mechanism only works from inside the pyspark job. You cannot, say, use this to have an HQL file available to be used by an Airflow SparkSqlOperator operator. This is so because SparkSqlOperator depends on files being available on HDFS. ( Details: SparkSQLCLIDriver.scala#L210 which in turn calls CliDriver.java#L436 )

(B) Load artifacts using existing artifacts.yaml and artifact-cache warm.

We already have an artifact mechanism that can pull any artifact by URI. We typically use it for tarballs, but it could pull in anything else. We would simply add to the Airflow instance's artifacts.yaml file something like this:

...

  my-project-sql-statement.hql:
    id: https://gitlab.wikimedia.org/repos/my-org/my-project/my-project-sql-statement.hql
    source: url

...

Then on Airflow DAG we do:

airflow_op = SparkSqlOperator(
    task_id=f"airflow_op_example",
    sql=var_props.get('hql_path', artifact('my-project-sql-statement.hql')),
    query_parameters={
        'source_table': destination_table,
        'destination_directory': temporary_directory,
        'year': year,
        ...
    },
)

Pros: Already available, no need to code anything. Folks using Airflow are familiar with this mechanism. Automatically deploys to HDFS and leverages the artifact() helper method.

Cons: If we add many HQL files to the artifacts.yaml of a particular Airflow instance, it may bloat it and make it hard to read. To avoid these, we could separate the existing artifacts.yaml into executable_artifacts.yaml and static_artifacts.yaml, or any other such permutation.

(C) Come up with an alternative to refinery that follows a similar pattern

Refinery artifacts are deployed via a scap script that, from the point of view of HDFS, does the following:

  1. Create a versioned HDFS file path, i.e., /wmf/refinery/2023-03-30T09.38.08+00.00--scap_sync_2023-03-30_0001
  1. Make /wmf/refinery/current point to (1)
  1. Deploy all artifacts into (1)

Pros: Well known pattern from our group. Has been working fine for Data Eng for a long while.

Cons: This will require new code, will also require that different users know about each other since we will need a new repo that hosts these artifacts. Can also require coordination between users which is one of the things we want to avoid.

I think we should go with option (B), but would like to understand if other folks have other ideas and/or there are other requirements I am unaware of.

CC @lbowmaker @mforns @mpopov

I think I'd also prefer option B! Seems straightforward enough from a usability perspective.

I like idea A because the conda env encapsulates all needed libs.

Another idea, we may add the HQL file beside the dag. We begin with this structure:

airflow-dags/analytics/dags/aqs/aqs_hourly_dag.py
airflow-dags/analytics/dags/aqs/aqs_hourly.hql

We run a periodic parser to send the HQL files to HDFS. Beside the Airflow Scheduler and webserver. And we add an MD5 (or the beginning of the commit sha1) as a postfix.

airflow-dags/analytics/dags/aqs/aqs_hourly.hql
=> hdfs:///wmf/airflow-dags/analytics/dags/aqs/aqs_hourly_bed78f6e.hql

Then it needs a small adaptation in our code to the artifact function to add an optional boolean parameter beside_the_dag, which will compute the hdfs path with the postfix.

I agree option B is good (although yes, the artifacts.yaml file gets potentially long).

When Andrew built the artifact module, we thought about this repository-as-artifact feature. Our blurry idea was to allow something like this:

# artifacts.yaml

my_repository_artifact:
    id: "https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags.git"
    version: 0.2.17  # <--- not sure how this would work
    source: wmf_gitlab

This would indicate Airflow where the repository lives and how to retrieve it.
The Airflow deployment would somehow read this repo, and sync it (clone) to the HDFS cache.
And then, from the DAG file, we'd use the artifact function to get the repo's base path in HDFS, like:

hql_file = artifact("my_repository_artifact") + "/hql/my_query.hql"

This approach would be flexible, users only need to commit files to a git repo, they don't need to package them.
It would also prevent the artifacts.yaml file to get too long, since you need only 1 artifact (for each version of the repo).
And it would follow the same logic as the other artifacts, which would reduce the cognitive load.

The issue that we saw was that the artifact module operates as if artifacts were files that can be opened, read and written.
And that did not match the repository cloning logic...

I like idea A because the conda env encapsulates all needed libs.

When Andrew built the artifact module, we thought about this repository-as-artifact feature. Our blurry idea was to allow something like this:
...

The issue that we saw was that the artifact module operates as if artifacts were files that can be opened, read and written.
And that did not match the repository cloning logic...

A mix of these ideas would be best!

We could reuse the conda tarball if available by documenting how to generate it so that it includes files like .hqls, and for projects that do not use Conda, Gitlab does allow us to get a tarball out of a project via their API. For example, for image_suggestions tag v0.11.0, we would just hit this URI: https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/archive/v0.11.0/image-suggestions-v0.11.0.tar.gz.

Also, we have the SparkSubmitOperator.for_virtualenv() helper method that, upon further inspection, uses spark-submit's --archives argument to send in a tarball that can be addressed via an alias.

I'll investigate if we could reuse this not just for PySpark but for any operator.

I'm thinking of something like this:

image_suggestions_tarball:
  id: https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/archive/v0.11.0/image-suggestions-v0.11.0.tar.gz   <<< We could document that this should be a tag instead of main?
  source: url

So one entry on artifacts.yaml.

And then modify SparkSqlOperator to accept archives with aliases, just like SparkSubmitOperator.for_virtualenv():

airflow_op = SparkSqlOperator(
    task_id=f"airflow_op_example",
    archive=artifact('image_suggestions_tarball')#alias
    sql=var_props.get('hql_path', alias/hql/my_query.hql'),
    query_parameters={
        'source_table': destination_table,
        'destination_directory': temporary_directory,
        'year': year,
        ...
    },
)

I'm not sure of this at all, but let me investigate if the archives flag works out of skein.

This looks great to me!
Would it be possible with this approach to get always the latest version of a repo?
Not sure this would be needed, but maybe?

This looks great to me!
Would it be possible with this approach to get always the latest version of a repo?
Not sure this would be needed, but maybe?

Yes, you can request to get HEAD from any branch. If you want HEAD from main then:

curl "https://gitlab.wikimedia.org/api/v4/projects/1261/repository/archive.tgz?sha=main" --output archive.tgz

Now, I was hoping that we would not to that. Instead I would prefer that we set this to an immutable URL such as a git tag, or a specific commit SHA. That way deployments are repeatable. A user can update the URI anytime to pickup changes. And thus versioning is pretty easy. WDYT @mforns?

When Andrew built the artifact module, we thought about this repository-as-artifact feature.
[...]
This would indicate Airflow where the repository lives and how to retrieve it.
The Airflow deployment would somehow read this repo, and sync it (clone) to the HDFS cache.

I think the intention was for job code, HQL or otherwise, to live in repos separate from the scheduler code (airflow-dags), and for the job repo CI to publish artifacts that could automatically be deployed to HDFS and used by airflow-dags.

I don't know if we had really considered cloning and syncing whole repository trees, I think that might be a little brittle.

Instead I would prefer that we set this to an immutable URL such as a git tag, or a specific commit SHA

Gitlab allows publishing anything to its package registry. Could the job repos have reusable CI to build a and publish a source tarball, and then you could use all the usual artifact syncing stuff for that tarball?

Then you are basically treating the tarball the same way as you might a project release. You could even automate the whole thing: push a new tag -> build and publish tarball package.

Hm, ready more carefully the above comments, I think that is kind of what you are trying to do, eh? I think getting the archive from the package registry will be better than storing the archive in git.

Would it be possible with this approach to get always the latest version of a repo?

Yes, you can request to get HEAD from any branch. If you want HEAD from main then:

Now, I was hoping that we would not to that. Instead I would prefer that we set this to an immutable URL such as a git tag, or a specific commit SHA. That way deployments are repeatable. A user can update the URI anytime to pickup changes. And thus versioning is pretty easy.

I really like this modified version of option B.

So modifications to the query & dag would essentially be behind 2 gates when deploying:

  • Merge request into main (which is the production branch and then we enforce development to happen on non-main branches only)
  • Update the immutable URL, which I think should use commit SHA rather than git tags because in my opinion dealing with semantic versioning for tagging would be unnecessary overhead

Instead I would prefer that we set this to an immutable URL such as a git tag, or a specific commit SHA

Gitlab allows publishing anything to its package registry. Could the job repos have reusable CI to build a and publish a source tarball, and then you could use all the usual artifact syncing stuff for that tarball?

Then you are basically treating the tarball the same way as you might a project release. You could even automate the whole thing: push a new tag -> build and publish tarball package.

Hm, ready more carefully the above comments, I think that is kind of what you are trying to do, eh?

What I propose above would use Gitlab's 'Repositories API' to generate a tarball of a project. Since the API allows us to set a sha parameter, we get robust versioning for free. So there doesn't seem to be a need for persisting artifacts at all in this case; they're generated at runtime.

I really like this modified version of option B.

Good to hear it fits your use case!

So modifications to the query & dag would essentially be behind 2 gates when deploying:

  • Merge request into main (which is the production branch and then we enforce development to happen on non-main branches only)
  • Update the immutable URL, which I think should use commit SHA rather than git tags because in my opinion dealing with semantic versioning for tagging would be unnecessary overhead

The process would be a bit longer: There is the airflow-dags repo, where we keep al the DAGs, and there is your my-project repo, where you keep your business logic (HQL files, pypsark jobs, etc.). To release:

  1. Commit your changes to my-project.
  2. If you're dealing with pyspark/R/Java, do a conda artifact release via our workflow_utils Gitlab CI set of jobs. If you just want your HQL files, you can just use the Gitlab Repositories API that I explained in the above comment to form an immutable URI.
  3. Commit your URI to your Airflow instance artifacts.yaml file available at airflow-dags.
  4. Do a scap deploy of your Airflow instance. This will copy the artifacts from the URI into HDFS, effectively putting your stuff in prod.

Regarding versioning details ( i.e. main as prod, sha vs tags vs tags as semantic versioning), we can come up with best practices, but I think teams can decide those details individually as well. From my point of view, all I need is for the URI to be immutable so that we have a reproducible environment.

What I propose above would use Gitlab's 'Repositories API' to generate a tarball of a project

Oh, cool! Sounds great.

If you're dealing with pyspark/R/Java, do a conda artifact release via our workflow_utils Gitlab CI set of jobs. If you just want your HQL files, you can just use the Gitlab Repositories API that I explained in the above comment to form an immutable URI.

Interesting! I wonder though...if the job code is just a standalone pyspark or R script (no extra dependencies), you might be able to use the repo tarball for this as well.

R script (no extra dependencies), you might be able to use the repo tarball for this as well.

Didn't we exclude R from base conda-analytics environment due to size? So even base R without any extra packages would still require a conda artifact release (note for future self: using these instructions), correct? (Not asking for any changes to that decision, just checking.)

even base R without any extra packages would still require a conda artifact release

Ah true. You are right.

An update:

I now have a working solution for this task. This is how the updated process looks:

There is the airflow-dags repo, where we keep al the DAGs, and there is your my-project repo, where you keep your HQL files.

To release SQL files:

  1. Commit your changes to my-project.
  2. Use GitLab's Raw File API to generate a URL that spits our your HQL file verbatim. These URLs look like so: https://gitlab.wikimedia.org/api/v4/projects/1261/repository/files/test%2Ftest_hql.hql/raw?ref=0e4d2a9. Notice how after /files/, you url-encode the path to your HQL file, and then with the ref parameter you can point to a particular commit SHA or tag.
  3. Update your SparkSqlOperator with your new URL from (2):
SparkSqlOperator(
        task_id="do_hql",
        # To run an HQL file, simply use a GitLab's raw URI that points to it.
        # See how to build such a URI here:
        # https://docs.gitlab.com/ee/api/repository_files.html#get-raw-file-from-repository
        # We strongly recommend you use an immutable URI (i.e. one that includes an SHA or a tag) for reproducibility
        sql=var_props.get(
            'hql_gitlab_raw_path',
            'https://gitlab.wikimedia.org/api/v4/projects/1261/repository/files/test%2Ftest_hql.hql/raw?ref=0e4d2a9'
        ),
        query_parameters={
            'destination_directory': f'/tmp/xcollazo_test_generic_artifact_deployment_dag/{{{{ts_nodash}}}}',
            'snapshot': '2023-01-02',
        },
        launcher='skein',
    )
  1. Do a scap deploy of your Airflow instance.

Note how there is no need for declaring artifacts anymore as we are pulling the SQL statement directly from GitLab's REST API. This solution thus only works for SQL files (or any other single file), but I believe that was the main intention.

Yesterday, @mforns, @JAllemandou and I got together to discuss https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/362.

We agreed that the URL approach is reasonable for the use case of deploying SQL code for folks that don't want to get very involve with Airflow. So we will move forward with this MR.

We also agreed that a more general mechanism for deploying tarballs and/or full repositories is still valuable. We will pursue that as part of a separate existing ticket: T322690.

Mentioned in SAL (#wikimedia-operations) [2023-04-27T19:03:59Z] <xcollazo@deploy1002> Started deploy [airflow-dags/platform_eng@f162f4d]: Deploying T333001 on platform_eng Airflow instance.

Mentioned in SAL (#wikimedia-operations) [2023-04-27T19:16:00Z] <xcollazo@deploy1002> Finished deploy [airflow-dags/platform_eng@f162f4d]: Deploying T333001 on platform_eng Airflow instance. (duration: 12m 01s)

Mentioned in SAL (#wikimedia-operations) [2023-05-18T18:56:38Z] <xcollazo@deploy1002> Started deploy [airflow-dags/platform_eng@502ddae]: T333001

Mentioned in SAL (#wikimedia-operations) [2023-05-18T18:57:13Z] <xcollazo@deploy1002> Finished deploy [airflow-dags/platform_eng@502ddae]: T333001 (duration: 00m 35s)