Page MenuHomePhabricator

Install spark3 in analytics clusters
Open, HighPublic

Description

Taken from a decision in a comment below:

Just discussed this and other options for installing Spark 3 with @JAllemandou and @Antoine_Quhen.

Create a new debian packaged 'conda base env' with the intention of using this to replace anaconda-wmf as described in T302819: Replace anaconda-wmf with smaller, non-stacked Conda environments. We can install all of this on the analytics-test-hadoop cluster while developing and testing, and also on the analytics-cluster alongside our current installation of Spark 2 and anaconda-wmf, with the aim of eventually deprecating both of those.

For now, we will focus on using the new conda base env to upgrade to Spark 3. Later, we will pursue replacing anaconda-wmf with this new conda base env in T302819.

To do this, we need to:

  • Figure out how to deal with Hadoop and Hive jars as 'provided' dependencies with latest Spark 3
  • Create a new 'debian' repo that uses workflow_utils conda-dist to create a debian packaged conda environment with python3.9 and pyspark 3 installed. The workflow_utils README has a good example of how to set up the python env repo for use with conda-dist. This repo and output conda env debian package will hopefully one day replace anaconda-wmf, so we should come up with a good name for it!
    1. Create the repo with conda-environment.yaml other dependency spec files so that conda-dist can automate the env packaging.
    2. Create a debian/ dir (with scripts or instructions) that can automate running conda-dist and using the output conda dist env to create a debian package of it. This is basically what the unpack_conda_prep_tar_into_debian_tree and build_debian_package bash functions are doing in the airflow env debianization
    3. When ready, build the .deb and import to apt.wikimedia.org.
  • Write new or adapt puppetization of spark to work with Spark installed in this conda base env. This includes automating uploading of the spark assembly zip file to HDFS, setting spark-defaults.conf and spark-env.sh configurations, making sure that Hadoop and Hive will work properly, etc.

Event Timeline

Looked into this a little bit today. Some context:

  • We create our own spark debian package using the Spark released distribution
  • We hack the Spark distribution to remove included Hadoop dependencies, but leave other ones like Hive. (We could use the hadoopless distribution, but in previous tests it didn't work because it missed other things, like Hive).
  • For python, it relies on the system installed python (unless manually overridden using PYSPARK_PYTHON).
    • We work around this by including any binary python dependencies for possible differing major versions of python we might use across the cluster, e.g. in the case of an OS upgrade.

This led me to investigate various options of installing spark. Important info:

Which gave me an idea!

Instead of relying on the system python, we could do like we've begun to do elsewhere: rely on conda environments to standardize the version of python we are using. Doing so could get tricky if the user also wants to use our anaconda-wmf stacked conda environment support...but not if our anaconda-wmf just had our version of spark installed.

I just tried in a couple of conda envs, installed with pip and with conda. Both seem to work fine, and also include spark scala / java dependencies and CLIs!

Spark R is not included. There do seem to be some conda packages for Spark R out there. I've always had a hard time maintaining Spark R support, since our team (and most of our users?) don't use it. I'd be willing to drop official support for Spark R, and ask users to install it themselves into their conda environments if they need to use it.

So, I propose seeing if we can install Spark 3 into our anaconda-wmf package. This may take a little bit of hackery to make sure the hadoop & hive jars and/or the classpaths are set properly and use our system installed Hadoop and Hive jars, TBD.

If this works, we'll be able to manage Spark and Python dependencies much more easily; just by upgrading them in anaconda-wmf.

Launching e.g. pyspark3 would then be done like:

/usr/lib/anaconda-wmf/bin/pyspark

Launching a Spark 3 scala shell:

/usr/lib/anaconda-wmf/bin/spark-shell

etc.

If you had your own activated and stacked conda env, then just: pyspark, spark-submit, etc.

That is interesting - this would be a stopgap until data eng is upgrading to a puppetized spark3? I imagine so, since there are no-conda prod use cases of spark.

Installing in the base image might be confusing since there will be multiple spark binaries on the PATH after activating a stacked conda env, it might make sense to just have a wiki describing how to install/configure spark3 in your own conda env on per-need basis.

odimitrijevic moved this task from Incoming to Transform on the Data-Engineering board.

this would be a stopgap until data eng is upgrading to a puppetized spark3?

No, I think we would also use the anaconda-wmf spark3 installation. And use anaconda-wmf as the default python whenever we need to?

Being able to manage python and spark and other dependency versions in a self contained environment will make a lot of maintenance tasks much easier.

Installing in the base image might be confusing since there will be multiple spark binaries on the PATH

I think this is ok. This is true for python and pip or any other package as well. The stacked env takes precedence.

Just discussed this and other options for installing Spark 3 with @JAllemandou and @Antoine_Quhen.

Decisions:

To do this, we need to do the following:

  • Figure out how to deal with Hadoop and Hive jars as 'provided' dependencies with latest Spark 3 distribution.
  • Create a new 'debian' repo that uses workflow_utils conda-dist to create a debian packaged conda environment with python3.9 and pyspark 3 installed.
  • Write new or adapt puppetization of spark to work with Spark installed in this conda base env. This includes automating uploading of the spark assembly zip file to HDFS, setting spark-defaults.conf and spark-env.sh configurations, making sure that Hadoop and Hive will work properly, etc.

We can install all of this on the analytics-test-hadoop cluster while developing and testing, and also on the analytics-cluster alongside our current installation of Spark 2 and anaconda-wmf, with the aim of eventually deprecating both of those.

@Ottomata Thanks for this update! The differential privacy project is currently using a jerry-rigged version of Spark 3 to run our software packages, so please let me know (either in this thread on phab or via slack) when you've been able to install Spark 3 on anaconda-wmf.

PS: I don't know if this will be at all useful, but if you want to take a look at how we've gotten Spark 3 working you can find the repo where we do it here.

Change 791323 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Add profile::hadoop:spark3 class and resources

https://gerrit.wikimedia.org/r/791323

Change 791323 merged by Ottomata:

[operations/puppet@production] Add profile::hadoop:spark3 class and resources

https://gerrit.wikimedia.org/r/791323

Change 791457 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Ensure spark3 conf dir exists

https://gerrit.wikimedia.org/r/791457

Change 791457 merged by Ottomata:

[operations/puppet@production] Ensure spark3 conf dir exists

https://gerrit.wikimedia.org/r/791457

@Antoine_Quhen @JAllemandou /etc/spark3/conf is now on an-launcher1002 and an-test-client1001 :) thank you!

Mentioned in SAL (#wikimedia-operations) [2022-05-23T14:08:31Z] <aqu@deploy1002> Started deploy [airflow-dags/analytics_test@95d0f86]: T295072 Spark 3 from Airflow venv pyspark [airflow-dags/analytics_test@95d0f86]

Mentioned in SAL (#wikimedia-operations) [2022-05-23T14:08:40Z] <aqu@deploy1002> Finished deploy [airflow-dags/analytics_test@95d0f86]: T295072 Spark 3 from Airflow venv pyspark [airflow-dags/analytics_test@95d0f86] (duration: 00m 08s)

Mentioned in SAL (#wikimedia-operations) [2022-05-23T14:12:33Z] <aqu@deploy1002> Started deploy [airflow-dags/analytics@95d0f86]: T295072 spark 3 from airflow venv pyspark [airflow-dags/analytics@95d0f86]

Mentioned in SAL (#wikimedia-operations) [2022-05-23T14:12:42Z] <aqu@deploy1002> Finished deploy [airflow-dags/analytics@95d0f86]: T295072 spark 3 from airflow venv pyspark [airflow-dags/analytics@95d0f86] (duration: 00m 08s)

Experimental Spark3 is in use for 1 job triggered by Airflow: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/config/experimental_spark_3_dag_default_args.py

It's using /etc/spark3/conf, and Spark3 provided by pyspark (dependency of Airflow).

Change 805855 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Include spark3 config on all hadoop client nodes

https://gerrit.wikimedia.org/r/805855

Change 805855 merged by Ottomata:

[operations/puppet@production] Include spark3 config on all hadoop client nodes

https://gerrit.wikimedia.org/r/805855

Change 813278 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Puppetize spark3 installation and configs using conda-analytics env

https://gerrit.wikimedia.org/r/813278

Change 813278 merged by Btullis:

[operations/puppet@production] Puppetize spark3 installation and configs using conda-analytics env

https://gerrit.wikimedia.org/r/813278

Change 821278 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Fix the spark3 profile

https://gerrit.wikimedia.org/r/821278

Change 821278 merged by Btullis:

[operations/puppet@production] Fix the spark3 profile

https://gerrit.wikimedia.org/r/821278

Change 821695 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/puppet@production] Puppetize spark3 installation and configs using conda-analytics env V2

https://gerrit.wikimedia.org/r/821695