Page MenuHomePhabricator

Create Code Repo and Structure
Closed, ResolvedPublic5 Estimated Story Points

Description

User Story
As a platform engineer, I need to create a new repo and structure to store the code for this use case and future use cases

A broader discussion on how to organise and share airflow DAGs at WMF is happening at https://phabricator.wikimedia.org/T290664.

The goal of 292743 is to define some structure for *development* DAGs. Those are not yet part of the Foundation's wide production deployments, and are rather targeting the PET owned an-airflow1003 instance.

Success Criteria
  • New repo with well structured folders and clear layout
References

The repo structure is inspired by

WIP

Details

Due Date
Nov 9 2021, 5:00 AM

Event Timeline

lbowmaker set Due Date to Nov 9 2021, 5:00 AM.
lbowmaker set the point value for this task to 5.

Let's say a dataset producer only cares about image recs and has no involvement in similar users.

Do they have to clone/fork the project, make their change to image recs (not touch similar users code), push their changes back. Am I understanding that correctly?

Do they have to clone/fork the project, make their change to image recs (not touch similar users code), push their changes back. Am I understanding that correctly?

That's correct. With the proposed approach, any change to the Image Matching data pipeline (airflow DAG, pyspark transformations, pinning versions for new algo releases) would require opening a PR against this code base. This would not touch similarusers or other projects in the repo.

Adding some comments from the grooming session today:

  • 1 repo with folders for each 'project' (similar users, image recs, etc)
  • platform-airflow-dags repo is for data pipeline code - not application code
  • Application code lives and is owned in the repo by the team who created it. In the instance of image recs - code is packaged by research, lives in their repo and is published as a package. Airflow dags install the package as dependency
  • If application code owners want to make a change, they do it in their repo, publish a new version and then Platform Eng updates DAG dependencies/config in the data pipeline repo

Hi @lbowmaker and @gmodena :]

I wanted to comment on this task, to better understand your Airflow needs and make sure that we do not repeat any work regarding Airflow jobs. I read in the task description that:

The goal of 292743 is to define some structure for *development* DAGs. Those are not yet part of the Foundation's wide production deployments, and are rather targeting the PET owned an-airflow1003 instance.

Now, our plan is that an-airflow1003 becomes the production Airflow instance for the Platform Eng team at some point. So, if platform-airflow-dags targets an-airflow1003, then it will be production, no?
If so, then I think it would be cool to find out a way that the common airflow repository covers your use cases, so we all use the same.

LMK your thoughts! Cheers

Hey @mforns,

Now, our plan is that an-airflow1003 becomes the production Airflow instance for the Platform Eng team at some point. So, if platform-airflow-dags targets an-airflow1003, then it will be production, no?

Could we define production in terms of basic SLOs?
My understanding is that the ganeti VMs were meant for experimentation, and we treat it as such.

We don't make assumptions about any specific host. If an-airflow1003 goes away, we can target other instances (provided they are configured with the same conventions).

However, we do make assumptions about having a development airflow instance, with access to YARN+HDFS, that we can deploy code to (and run dags on) autonomously. If an-airflow1003 becomes a prod host, we'll still need a dev environment with these capabilities.

I like the proposal you previously made to spin up dev instances "on demand" on stat hosts. We could easily integrate that with our tooling, but we need to align milestones & timelines.

I have an RFC piece of documentation for this task at https://meta.wikimedia.org/wiki/User:GModena_(WMF)/Pipelines_Repo_Structure.

It’s a draft under my username to discuss a setup coupled with our current dev environment.
Action points for closing this task:

  • Give an ack re repo renaming
  • Give an ack re development vs production airflow systesm
  • Give an ack re breaking out projects using submodules (at a later stage)

Created this placeholder for our backlog in case there are any comments that require further work:

https://phabricator.wikimedia.org/T295364

Marking this ticket as done after review.

@gmodena: Hi, the Due Date set for this open task passed a while ago.
Could you please either update or reset the Due Date (by clicking Edit Task), or set the status of this task to resolved in case this task is done? Thanks!