Page MenuHomePhabricator

Migrate Oozie jobs to Airflow
Closed, ResolvedPublic

Description

User Story
As a data engineer, I want to begin consolidating our ETL jobs into Airflow, so that I can get faster at deploying, maintaining and optimising our Data Pipelines
Scope

We currently have 100 workflows that are triggered and managed by Oozie. These workflows involve orchestrating multiple different kinds of in-order steps including:

  • hql scripts running in Hive
  • spark scripts
  • hdfs calls
  • Conditional checks

Goal:
The goal is to transition those jobs to be managed by Airflow.

Next Steps:
  • Identify Low Risk, Low Complexity Jobs for new team members to begin migrating.
  • Create a Migration Plan for the more complex and higher risk jobs
  • Identify Current Oozie Jobs that could be Redesigned/Refactored when moved to Airflow
Success Criteria
  • Have all our Oozie jobs moved into our airflow instance.
  • Using Oozie is no longer required to schedule Data jobs
Open questions / remarks
  • Do we have all the required operators?
  • Who needs to validate that the pipeline is working as intended?

Event Timeline

odimitrijevic subscribed.

Please create subtasks for specific jobs that are migrated.

Please add additional columns risk & complexity assessed to the following spreadsheet so that we can prioritize accordingly.
https://docs.google.com/spreadsheets/d/1lfK5Idteh6zPSlCWyH34FJCl_Lcm8401Wm59Jgk-7wM/edit?usp=sharing

Change 756017 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/puppet@production] Airflow: Fix links in error emails

https://gerrit.wikimedia.org/r/756017