Taken from a decision in a comment below:
Just discussed this and other options for installing Spark 3 with @JAllemandou and @Antoine_Quhen.
Create a new debian packaged 'conda base env' with the intention of using this to replace anaconda-wmf as described in T302819: Replace anaconda-wmf with smaller, non-stacked Conda environments. We can install all of this on the analytics-test-hadoop cluster while developing and testing, and also on the analytics-cluster alongside our current installation of Spark 2 and anaconda-wmf, with the aim of eventually deprecating both of those.
For now, we will focus on using the new conda base env to upgrade to Spark 3. Later, we will pursue replacing anaconda-wmf with this new conda base env in T302819.
To do this, we need to:
- Figure out how to deal with Hadoop and Hive jars as 'provided' dependencies with latest Spark 3
- Create a new 'debian' repo that uses workflow_utils conda-dist to create a debian packaged conda environment with python3.9 and pyspark 3 installed. The workflow_utils README has a good example of how to set up the python env repo for use with conda-dist. This repo and output conda env debian package will hopefully one day replace anaconda-wmf, so we should come up with a good name for it!
- Create the repo with conda-environment.yaml other dependency spec files so that conda-dist can automate the env packaging.
- Create a debian/ dir (with scripts or instructions) that can automate running conda-dist and using the output conda dist env to create a debian package of it. This is basically what the unpack_conda_prep_tar_into_debian_tree and build_debian_package bash functions are doing in the airflow env debianization
- When ready, build the .deb and import to apt.wikimedia.org.
- Write new or adapt puppetization of spark to work with Spark installed in this conda base env. This includes automating uploading of the spark assembly zip file to HDFS, setting spark-defaults.conf and spark-env.sh configurations, making sure that Hadoop and Hive will work properly, etc.