Page MenuHomePhabricator

[Airflow] Create repository for Airflow DAGs
Closed, ResolvedPublic

Description

We agreed to create a single repository for all teams using Airflow.
It should contain Airflow DAG code and related functionalities.

Naming:
Candidates are airflow-dags and airflow-jobs.

Tool:
Should we use GitLab, or Gerrit? [SOLVED]
After team unanimity, let's use GitLab.

Event Timeline

My personal opinion is that it should be called airflow-config so that the file directory structure will look like airflow-config/dags/my_cool_dag etc.

odimitrijevic triaged this task as High priority.
odimitrijevic moved this task from Incoming to Airflow on the Analytics board.

Heya @razzi :]

My personal opinion is that it should be called airflow-config so that the file directory structure will look like airflow-config/dags/my_cool_dag etc.

I think the name airflow-config covers very well one part of what the repository will contain: Spark configs, Skein configs, dependency configs, Job configs (frequency, data dependencies, etc.),
but at the same time doesn't cover another important part of what's going to be there, which is job logic and job definitions.
Maybe we should go with airflow-jobs since it's less controversial than airflow-dags and more intuitive?

How about calling the repository wikimedia-airflow? The repository name is going to be imported, so it's worth thinking about it as a python module name as well as a repository / folder name.

In the airflow docs we see:

The default location for your DAGs is ~/airflow/dags.

so we'd be following their pattern; rather than calling it simply airflow, since it's wikimedia specific, wikimedia-airflow.


With wikimedia-airflow, our users will probably put it somewhere like ~/work/wikimedia-airflow, and then reference dags like python ~/work/wikimedia-airflow/dags/example.py.

More importantly, when we want to do things like share operators, our code will look like:

from wikimedia_airflow.operators import cool_operator

which reads logically. If we use a name like airflow_jobs, it'll look like

from airflow_jobs.operators import cool_operator

which to me is confusing, because I wouldn't think that operators would be in the airflow_jobs module.

Hey @razzi :] responding here:

How about calling the repository wikimedia-airflow? The repository name is going to be imported, so it's worth thinking about it as a python module name as well as a repository / folder name.

This repository won't be used as a package by other software (at least that's the way we projected it so far). It is intended solely for use with Airflow.
Now, I imagine you are talking about the DAGs importing shared functionalities like DAG templates, custom operators, plugins and such. If so, agree with you, there will be imports.
Now, this repository will be structured in different folders: 1 folder per team (containing all DAGs for that team) plus one folder containing those shared functionalities.
The shared functionalities folder can be a separate python package by its own. And we can name that freely, to make for intuitive and elegant imports.

from airflow_jobs.operators import cool_operator

Yes, with our current repo, this won't look like that. It should be something like:

from airflow_tools.operators import cool_operator

(Replace airflow_tools with any cooler name that you can think of)
We can name the shared folder module the way we prefer, regardless of repo name.