Page MenuHomePhabricator

Create Conda Analytics environment including spark version 3.5.3
Open, HighPublic

Description

We currently use conda-analytics to ship our pyspark distribution.

As part of the upgrade to spark in production, we will need to update this environment with pyspark version 3.5.3.

Event Timeline

Gehel triaged this task as High priority.Jan 10 2024, 9:50 AM
Gehel removed a project: Data-Engineering.
Gehel moved this task from Incoming to Software Upgrades on the Data-Platform-SRE board.
Gehel lowered the priority of this task from High to Medium.Jan 10 2024, 9:53 AM
BTullis raised the priority of this task from Medium to High.Nov 8 2024, 11:06 AM
BTullis renamed this task from Create Conda Analytics environment for new Spark version to Create Conda Analytics environment including spark version 3.5.3.Nov 15 2024, 11:42 AM
BTullis updated the task description. (Show Details)

We have a debian package of conda-analytics 0.0.37 available here: https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/package_files/5821/download

This includes Spark 3.5.3 and Iceberg 1.7.0 so it's ready for testing.

BTullis subscribed.

Unassigning myself, since we are not actively working on this.

Upgrading to Spark 3.5 should allow us to remove the version specs and pins for:

  • Pandas (T370705, T370707)
  • Numpy (T370710)
  • PyArrow (since I believe that its pin is for the current Pandas version)

It's possible that new specs and pins for later versions of these packages will be needed, but the best way to figure that out is to have an analyst test out a pre-release version of the environment.