Page MenuHomePhabricator

Spark3 migration - Currently existing airflow jobs
Closed, ResolvedPublic

Details

TitleReferenceAuthorSource BranchDest Branch
Update spark3 analytics for SparkSQLNoCLIDriverrepos/data-engineering/airflow-dags!62joalspark3_sqlmain
Customize query in GitLab

Event Timeline

JAllemandou renamed this task from Plan spark3 migration - possibly incrementally to Plan spark3 jobs migration - possibly incrementally.Apr 27 2022, 6:53 AM
JAllemandou renamed this task from Plan spark3 jobs migration - possibly incrementally to Plan spark3 migration - possibly incrementally.

decisions for Spark3:

  • We're gonna merge and release the refinery-source patch bumping Spark and Scala as is, changing refinery-source verison to 0.2.0 (not all jobs have been tested, the list is documented in the commit message)
  • We're gonna use this new refinery-source release to migrate existing Airflow jobs to Spark3, using the SaprkNoCLIDriver in cluster mode instead of the skein in client mode deploy strategy. Some airflow hacking might be needed here.
  • The merge of the refinery-source code doesn't impact already running jobs as we refence jars by version. However it means that any new change to scala code needs to be done in Scala 2.12, and the relative jobs need to be migrated to Spark3 (and therefore airflow). This shall push us to migrate to airflow faster :)
mforns moved this task from In Review to Done on the Data Pipelines board.
JAllemandou renamed this task from Plan spark3 migration - possibly incrementally to Spark3 migration - Currently existing airflow jobs.Jun 14 2022, 7:36 AM
JAllemandou moved this task from In Progress to Done on the Data-Engineering-Kanban board.
JAllemandou moved this task from Done to In Review on the Data Pipelines board.
JAllemandou updated the task description. (Show Details)