We currently use conda-analytics to ship our pyspark distribution.
As part of the upgrade to spark in production, we will need to update this environment with pyspark version 3.5.3.
We currently use conda-analytics to ship our pyspark distribution.
As part of the upgrade to spark in production, we will need to update this environment with pyspark version 3.5.3.
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T333013 [Iceberg Migration] Apache Iceberg Migration | |||
| Open | None | T338057 Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 | |||
| Resolved | BTullis | T380035 Create Spark docker images for version 3.5.3 | |||
| Open | None | T354733 Create Conda Analytics environment including spark version 3.5.3 |
btullis updated https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/56
Upgrade spark and iceberg
We have a debian package of conda-analytics 0.0.37 available here: https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/package_files/5821/download
This includes Spark 3.5.3 and Iceberg 1.7.0 so it's ready for testing.
Upgrading to Spark 3.5 should allow us to remove the version specs and pins for:
It's possible that new specs and pins for later versions of these packages will be needed, but the best way to figure that out is to have an analyst test out a pre-release version of the environment.