In our Iceberg Working Session we ran out of time before discussing bumping Spark, however there was async support for it.
Our current production version of Spark, 3.1, is ‘deprecated’ on Icebergs support matrix, and there are talks of dropping support. Update: support has been dropped as of Iceberg 1.4.0.
Options:
a) The Spark community released 3.4.0 on April 13 2023. Iceberg just released version 1.3.0 with support for Spark 3.4. This is the bleeding edge, but as with any .0 feature release there is risk of bugs on both Spark and Iceberg. We would have to bump Iceberg as well. We do win the longest runway. Update: Spark 3.4.1 is now available. Second update: Spark 3.5.0 is also now available.
b) The Spark community released 3.3.2 on Feb 17 2023. Iceberg has supported Spark 3.3 since 0.14.0. We already have Iceberg 1.2.1 which supports Spark 3.3, and the 3.3.2 is stable and well tested by now. We get a relatively shorter runway with this.
Whether we bump to 3.3, 3.4, or 3.5 line, we do win a bunch of perf improvements that will go well with T332765.
Migration guides:
https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-31-to-32
https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-32-to-33
https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-33-to-34
https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-34-to-35
Considering the migration guide does have breaking changes on syntax like ADD JAR and CSV output defaults (I originally thought there were none), it does seem like we should consider having the new spark version available jointly with the current version for a while. Perhaps by making it available as spark3_4-submit, etc?
In this task we should:
- Decide whether to bump to Spark 3.3.X, 3.4.X, or 3.5.X line.
- Decide whether to remove current Spark 3.1.2, or to have it available at the same time for a while.
- Install it on test cluster. Do sanity tests.
- Install it on main cluster.