Agree on a repository structure for Airflow-related code
Closed, DuplicatePublic
Actions

Assigned To

Authored By

	mforns
	Sep 9 2021, 2:10 PM

Description

To be able to start writing Airflow code, we need to define a repository structure for our related code.
Let's discuss in this task, and come to an agreement :-)

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		odimitrijevic	T282033 Airflow collaborations
		Duplicate		mforns	T290664 Agree on a repository structure for Airflow-related code

Event Timeline

mforns created this task.Sep 9 2021, 2:10 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 9 2021, 2:10 PM

Option 1: Single repository

Put all code in the same repository, organized in high-level folders including:

Shared code (DAG templates, custom operators, common defaults, etc.)
Code specific to team A
Code specific to team B
...
Code specific to team N
Others (scripts, config, CI, license, readme, etc.)

PROS

Having a single repository is easier and faster to configure and get working.
The repository could include a set of scripts that quickly spin up a development environment that one can use i.e. in a stats machine. That would be usable consistently by all teams and we all could contribute to improve it.
Having one single repository for teams code helps setting a standard for coding (syntax and folder structure) and best practices across teams.
Having one single repository for teams code makes changes/features/ideas from other teams more visible and easier to reuse; Also helps newcomers have plenty examples to base their new code on.

CONS

We might have shared code versioning problems. If a team modifies a shared feature, it can affect the development/deployment pipeline of other teams.

Option 2: Common repository + Shared library repository

Have 2 repositories: The common repository, organized in high-level folders per team, including:

Code specific to team A
Code specific to team B
...
Code specific to team N
Others (scripts, config, CI, license, readme, etc.)

And the shared library repository, including:

Shared code (DAG templates, custom operators, common defaults, etc.)
Others (scripts, config, CI, license, readme, etc.)

PROS

The common teams repository could include a set of scripts that quickly spin up a development environment that one can use i.e. in a stats machine. That would be usable consistently by all teams and we all could contribute to improve it.
Having one single repository for teams code helps setting a standard for coding (syntax and folder structure) and best practices across teams.
Having one single repository for teams code makes changes/features/ideas from other teams more visible and easier to reuse; Also helps newcomers have plenty examples to base their new code on.
The separate shared library repository allows for shared code versioning. So that each team can develop and deploy at their own pace.

CONS

Configuring the 2 repositories is less straight forward than just 1 repository.

mforns added a parent task: T282033: Airflow collaborations.Sep 9 2021, 3:05 PM

mforns added a project: Analytics.

mforns added subscribers: Ottomata, odimitrijevic, elukey and 6 others.Sep 9 2021, 3:07 PM

BTullis subscribed.Sep 9 2021, 4:01 PM

odimitrijevic assigned this task to mforns.Sep 9 2021, 4:39 PM

odimitrijevic triaged this task as High priority.

odimitrijevic moved this task from Incoming to Airflow on the Analytics board.

Hey @mforns thanks for starting this.

To keep complexity low to being with, I'd be keen to start with a monorepo like the pattern proposed in Option 1. IMHO it's easier to split things up, rather than consolidate them.

In practice I need to cut my teeth and understand the practical implications of the tradeoffs you highlighted. I don't have a clear model of boundaries between teams / workflows, and the way dags and deps could be vendored and deployed.

I'm toying around with this structure https://github.com/gmodena/wmf-platform-airflow-dags, that is inspired by best practices suggested by Astronomer (and other literature) but I would not take for granted that they can map directly to our use cases. It's a monorepo along the lines you highlighted in Option 1 and aims to follow the approach taken by Search & Discovery for deployment (scap).

Happy to write up some thoughts once I have some more hands on experience.

FWIW I found these references useful:

https://tech.scribd.com/blog/2020/breaking-up-the-dag-repo.html (a counterargument to the monorepo approach).
https://towardsdatascience.com/dag-factories-a-better-way-to-airflow-9aa3cf003169

@gmodena thanks for chiming in!

To keep complexity low to being with, I'd be keen to start with a monorepo like the pattern proposed in Option 1. IMHO it's easier to split things up, rather than consolidate them.

Agree. And we Data Engineering also are leaning towards option 1.

In practice I need to cut my teeth and understand the practical implications of the tradeoffs you highlighted. I don't have a clear model of boundaries between teams / workflows, and the way dags and deps could be vendored and deployed.

Yes, as we're all starting with Airflow, it's difficult to figure out the best way from scratch. I think we'll learn things as we move forward.

Thanks for the references. I found the one about dag-factories very interesting :]

mostly echoing @gmodena - I'm in favor of option 1. The ML Team has also been working with a monorepo for Lift-Wing inference services (model servers, clients & config) which has allowed us to iterate quickly with low overhead while we get the project off the ground. We've tried to keep things modular in case we need to split the repo in the future, but so far things have been good for the past 6-8 months.

I also think having a single repo would assist in onboarding newcomers to Airflow and help reach consesus re: best practices etc.

Thanks @ACraze for your thoughts!
I think we people who spoke in this task have a common understanding and preference for option 1.
People I spoke outside the task, are also not opposed to the mono-repository.
I'd say, we can move forward with it.

Ideas for naming the repo?

airflow-jobs ?

airflow-jobs ?

I like

fkaelin mentioned this in T290171: Scheduling production pipelines.Sep 20 2021, 1:38 PM

mforns closed this task as a duplicate of T294026: [Airflow] Create repository for Airflow DAGs.Oct 22 2021, 5:01 PM

gmodena mentioned this in T292743: Create Code Repo and Structure.Oct 26 2021, 6:57 PM

Agree on a repository structure for Airflow-related codeClosed, DuplicatePublicActions