Title | Reference | Author | Source Branch | Dest Branch | |
---|---|---|---|---|---|
Update spark3 analytics for SparkSQLNoCLIDriver | repos/data-engineering/airflow-dags!62 | joal | spark3_sql | main |
Details
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T291464 Upgrade analytics-hadoop to Spark 3 + scala 2.12 | |||
Resolved | JAllemandou | T306955 Spark3 migration - Currently existing airflow jobs | |||
Open | None | T291386 Upgrade Refinery Jobs to Spark 3 | |||
Open | None | T209453 Refine: Use Spark SQL instead of Hive JDBC | |||
Open | None | T307040 Propagate field descriptions from event schemas to Hive event tables | |||
Open | None | T255818 Refine drops $schema field values | |||
Open | None | T259924 HiveExtensions.convertToSchema does not properly convert arrays of structs | |||
Open | None | T366487 Event Platform schemas should not support type changes to structs as array element or map value types |
Event Timeline
Comment Actions
decisions for Spark3:
- We're gonna merge and release the refinery-source patch bumping Spark and Scala as is, changing refinery-source verison to 0.2.0 (not all jobs have been tested, the list is documented in the commit message)
- We're gonna use this new refinery-source release to migrate existing Airflow jobs to Spark3, using the SaprkNoCLIDriver in cluster mode instead of the skein in client mode deploy strategy. Some airflow hacking might be needed here.
- The merge of the refinery-source code doesn't impact already running jobs as we refence jars by version. However it means that any new change to scala code needs to be done in Scala 2.12, and the relative jobs need to be migrated to Spark3 (and therefore airflow). This shall push us to migrate to airflow faster :)