Spark3 migration - Currently existing airflow jobs
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	JAllemandou
	Apr 27 2022, 6:52 AM

Details

	Title	Reference	Author	Source Branch	Dest Branch
	Update spark3 analytics for SparkSQLNoCLIDriver	repos/data-engineering/airflow-dags!62	joal	spark3_sql	main

Customize query in GitLab

Related Objects
Search...

Status	Assigned	Task
Open	None	T291464 Upgrade analytics-hadoop to Spark 3 + scala 2.12
Resolved	JAllemandou	T306955 Spark3 migration - Currently existing airflow jobs
Open	None	T291386 Upgrade Refinery Jobs to Spark 3
Open	None	T209453 Refine: Use Spark SQL instead of Hive JDBC
Open	None	T307040 Propagate field descriptions from event schemas to Hive event tables
Open	None	T255818 Refine drops $schema field values
Open	None	T259924 HiveExtensions.convertToSchema does not properly convert arrays of structs
Open	None	T366487 Event Platform schemas should not support type changes to structs as array element or map value types

Event Timeline

JAllemandou created this task.Apr 27 2022, 6:52 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 27 2022, 6:52 AM

JAllemandou renamed this task from Plan spark3 migration - possibly incrementally to Plan spark3 jobs migration - possibly incrementally.Apr 27 2022, 6:53 AM

JAllemandou renamed this task from Plan spark3 jobs migration - possibly incrementally to Plan spark3 migration - possibly incrementally.

JAllemandou added a parent task: T291464: Upgrade analytics-hadoop to Spark 3 + scala 2.12.Apr 27 2022, 6:55 AM

JAllemandou added a subtask: T291386: Upgrade Refinery Jobs to Spark 3.Apr 27 2022, 7:03 AM

JAllemandou moved this task from Next Up to In Progress on the Data-Engineering-Kanban board.Apr 27 2022, 7:55 AM

BTullis subscribed.May 4 2022, 4:16 PM

JArguello-WMF added a project: Data Pipelines.May 11 2022, 6:37 PM

JArguello-WMF moved this task from Incoming (new tickets) to Transform on the Data-Engineering board.

mforns moved this task from Backlog to Estimated on the Data Pipelines board.May 23 2022, 3:45 PM

decisions for Spark3:

We're gonna merge and release the refinery-source patch bumping Spark and Scala as is, changing refinery-source verison to 0.2.0 (not all jobs have been tested, the list is documented in the commit message)
We're gonna use this new refinery-source release to migrate existing Airflow jobs to Spark3, using the SaprkNoCLIDriver in cluster mode instead of the skein in client mode deploy strategy. Some airflow hacking might be needed here.
The merge of the refinery-source code doesn't impact already running jobs as we refence jars by version. However it means that any new change to scala code needs to be done in Scala 2.12, and the relative jobs need to be migrated to Spark3 (and therefore airflow). This shall push us to migrate to airflow faster :)

Let's do it!

mforns moved this task from Estimated to In Review on the Data Pipelines board.Jun 13 2022, 3:40 PM

mforns moved this task from In Review to Done on the Data Pipelines board.

JAllemandou renamed this task from Plan spark3 migration - possibly incrementally to Spark3 migration - Currently existing airflow jobs.Jun 14 2022, 7:36 AM

JAllemandou moved this task from In Progress to Done on the Data-Engineering-Kanban board.

JAllemandou moved this task from Done to In Review on the Data Pipelines board.

JAllemandou updated the task description. (Show Details)

JAllemandou moved this task from Done to In Code Review on the Data-Engineering-Kanban board.Jun 14 2022, 7:38 AM

JAllemandou moved this task from In Code Review to Ready to Deploy on the Data-Engineering-Kanban board.

JAllemandou moved this task from Ready to Deploy to Done on the Data-Engineering-Kanban board.Jun 30 2022, 8:14 AM

JArguello-WMF closed this task as Resolved.Jul 5 2022, 3:42 PM