Page MenuHomePhabricator

Add support for repository artifacts in Airflow
Open, Needs TriagePublic

Description

Problem

Right now, in Airflow, we Data Engineering define our HQL dependencies via direct HDFS paths.
These HQL dependencies are not managed by our Airflow's artifact module.
Meaning, they do not get sync-ed to HDFS to be used by operators automatically at Airflow deployment time.
Thus, we need to deploy them manually, using refinery deployment, and make sure the HQL file paths match.

Why are we doing this?

We could currently define an artifact for each HQL file, and that would work, without the need of the extra refinery deployment.
However, this would require to define an artifact for each HQL file we have in our jobs, that would make probably a couple hundred artifacts.

Better solution

Instead, we could add support to treating full repositories as Airflow artifacts. We'd add the full repository URL as a i.e. git artifact,
and Airflow would sync (clone?) the repo to HDFS, like with the other artifacts.
Then from the DAG code, we could use the reference to that artifact as a base path, and append the path to the desired query after it.
This way we'd have way less artifacts defined, and still let Airflow manage our HQL dependencies (no refinery deployment).

Event Timeline

@lbowmaker I think the solution offered by T333001 is sightly different from what this task proposes.
The linked task allows to include SQL query file paths directly in SparkSqlOperators,
while what this task aims to import a whole repository as an artifact.

@mforns - thanks for clarifying. Re-opened