Page MenuHomePhabricator

Agree on a repository structure for Airflow-related code
Closed, DuplicatePublic

Description

To be able to start writing Airflow code, we need to define a repository structure for our related code.
Let's discuss in this task, and come to an agreement :-)

Event Timeline

Option 1: Single repository

Put all code in the same repository, organized in high-level folders including:

  • Shared code (DAG templates, custom operators, common defaults, etc.)
  • Code specific to team A
  • Code specific to team B
  • ...
  • Code specific to team N
  • Others (scripts, config, CI, license, readme, etc.)

PROS

  • Having a single repository is easier and faster to configure and get working.
  • The repository could include a set of scripts that quickly spin up a development environment that one can use i.e. in a stats machine. That would be usable consistently by all teams and we all could contribute to improve it.
  • Having one single repository for teams code helps setting a standard for coding (syntax and folder structure) and best practices across teams.
  • Having one single repository for teams code makes changes/features/ideas from other teams more visible and easier to reuse; Also helps newcomers have plenty examples to base their new code on.

CONS

  • We might have shared code versioning problems. If a team modifies a shared feature, it can affect the development/deployment pipeline of other teams.

Option 2: Common repository + Shared library repository

Have 2 repositories: The common repository, organized in high-level folders per team, including:

  • Code specific to team A
  • Code specific to team B
  • ...
  • Code specific to team N
  • Others (scripts, config, CI, license, readme, etc.)

And the shared library repository, including:

  • Shared code (DAG templates, custom operators, common defaults, etc.)
  • Others (scripts, config, CI, license, readme, etc.)

PROS

  • The common teams repository could include a set of scripts that quickly spin up a development environment that one can use i.e. in a stats machine. That would be usable consistently by all teams and we all could contribute to improve it.
  • Having one single repository for teams code helps setting a standard for coding (syntax and folder structure) and best practices across teams.
  • Having one single repository for teams code makes changes/features/ideas from other teams more visible and easier to reuse; Also helps newcomers have plenty examples to base their new code on.
  • The separate shared library repository allows for shared code versioning. So that each team can develop and deploy at their own pace.

CONS

  • Configuring the 2 repositories is less straight forward than just 1 repository.
odimitrijevic triaged this task as High priority.
odimitrijevic moved this task from Incoming to Airflow on the Analytics board.

Hey @mforns thanks for starting this.

To keep complexity low to being with, I'd be keen to start with a monorepo like the pattern proposed in Option 1. IMHO it's easier to split things up, rather than consolidate them.

In practice I need to cut my teeth and understand the practical implications of the tradeoffs you highlighted. I don't have a clear model of boundaries between teams / workflows, and the way dags and deps could be vendored and deployed.

I'm toying around with this structure https://github.com/gmodena/wmf-platform-airflow-dags, that is inspired by best practices suggested by Astronomer (and other literature) but I would not take for granted that they can map directly to our use cases. It's a monorepo along the lines you highlighted in Option 1 and aims to follow the approach taken by Search & Discovery for deployment (scap).

Happy to write up some thoughts once I have some more hands on experience.

FWIW I found these references useful:

@gmodena thanks for chiming in!

To keep complexity low to being with, I'd be keen to start with a monorepo like the pattern proposed in Option 1. IMHO it's easier to split things up, rather than consolidate them.

Agree. And we Data Engineering also are leaning towards option 1.

In practice I need to cut my teeth and understand the practical implications of the tradeoffs you highlighted. I don't have a clear model of boundaries between teams / workflows, and the way dags and deps could be vendored and deployed.

Yes, as we're all starting with Airflow, it's difficult to figure out the best way from scratch. I think we'll learn things as we move forward.

Thanks for the references. I found the one about dag-factories very interesting :]

mostly echoing @gmodena - I'm in favor of option 1. The ML Team has also been working with a monorepo for Lift-Wing inference services (model servers, clients & config) which has allowed us to iterate quickly with low overhead while we get the project off the ground. We've tried to keep things modular in case we need to split the repo in the future, but so far things have been good for the past 6-8 months.

I also think having a single repo would assist in onboarding newcomers to Airflow and help reach consesus re: best practices etc.

Thanks @ACraze for your thoughts!
I think we people who spoke in this task have a common understanding and preference for option 1.
People I spoke outside the task, are also not opposed to the mono-repository.
I'd say, we can move forward with it.