Page MenuHomePhabricator

Install spark3 in analytics clusters
Closed, ResolvedPublic9 Estimated Story Points

Description

Taken from a decision in a comment below:

Just discussed this and other options for installing Spark 3 with @JAllemandou and @Antoine_Quhen.

Create a new debian packaged 'conda base env' with the intention of using this to replace anaconda-wmf as described in T302819: Replace anaconda-wmf with smaller, non-stacked Conda environments. We can install all of this on the analytics-test-hadoop cluster while developing and testing, and also on the analytics-cluster alongside our current installation of Spark 2 and anaconda-wmf, with the aim of eventually deprecating both of those.

For now, we will focus on using the new conda base env to upgrade to Spark 3. Later, we will pursue replacing anaconda-wmf with this new conda base env in T302819.

To do this, we need to:

  • Figure out how to deal with Hadoop and Hive jars as 'provided' dependencies with latest Spark 3
  • Create a new 'debian' repo that uses workflow_utils conda-dist to create a debian packaged conda environment with python3.9 and pyspark 3 installed. The workflow_utils README has a good example of how to set up the python env repo for use with conda-dist. This repo and output conda env debian package will hopefully one day replace anaconda-wmf, so we should come up with a good name for it!
    1. Create the repo with conda-environment.yaml other dependency spec files so that conda-dist can automate the env packaging.
    2. Create a debian/ dir (with scripts or instructions) that can automate running conda-dist and using the output conda dist env to create a debian package of it. This is basically what the unpack_conda_prep_tar_into_debian_tree and build_debian_package bash functions are doing in the airflow env debianization
    3. When ready, build the .deb and import to apt.wikimedia.org.
  • Write new or adapt puppetization of spark to work with Spark installed in this conda base env. This includes automating uploading of the spark assembly zip file to HDFS, setting spark-defaults.conf and spark-env.sh configurations, making sure that Hadoop and Hive will work properly, etc.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Looked into this a little bit today. Some context:

  • We create our own spark debian package using the Spark released distribution
  • We hack the Spark distribution to remove included Hadoop dependencies, but leave other ones like Hive. (We could use the hadoopless distribution, but in previous tests it didn't work because it missed other things, like Hive).
  • For python, it relies on the system installed python (unless manually overridden using PYSPARK_PYTHON).
    • We work around this by including any binary python dependencies for possible differing major versions of python we might use across the cluster, e.g. in the case of an OS upgrade.

This led me to investigate various options of installing spark. Important info:

Which gave me an idea!

Instead of relying on the system python, we could do like we've begun to do elsewhere: rely on conda environments to standardize the version of python we are using. Doing so could get tricky if the user also wants to use our anaconda-wmf stacked conda environment support...but not if our anaconda-wmf just had our version of spark installed.

I just tried in a couple of conda envs, installed with pip and with conda. Both seem to work fine, and also include spark scala / java dependencies and CLIs!

Spark R is not included. There do seem to be some conda packages for Spark R out there. I've always had a hard time maintaining Spark R support, since our team (and most of our users?) don't use it. I'd be willing to drop official support for Spark R, and ask users to install it themselves into their conda environments if they need to use it.

So, I propose seeing if we can install Spark 3 into our anaconda-wmf package. This may take a little bit of hackery to make sure the hadoop & hive jars and/or the classpaths are set properly and use our system installed Hadoop and Hive jars, TBD.

If this works, we'll be able to manage Spark and Python dependencies much more easily; just by upgrading them in anaconda-wmf.

Launching e.g. pyspark3 would then be done like:

/usr/lib/anaconda-wmf/bin/pyspark

Launching a Spark 3 scala shell:

/usr/lib/anaconda-wmf/bin/spark-shell

etc.

If you had your own activated and stacked conda env, then just: pyspark, spark-submit, etc.

That is interesting - this would be a stopgap until data eng is upgrading to a puppetized spark3? I imagine so, since there are no-conda prod use cases of spark.

Installing in the base image might be confusing since there will be multiple spark binaries on the PATH after activating a stacked conda env, it might make sense to just have a wiki describing how to install/configure spark3 in your own conda env on per-need basis.

this would be a stopgap until data eng is upgrading to a puppetized spark3?

No, I think we would also use the anaconda-wmf spark3 installation. And use anaconda-wmf as the default python whenever we need to?

Being able to manage python and spark and other dependency versions in a self contained environment will make a lot of maintenance tasks much easier.

Installing in the base image might be confusing since there will be multiple spark binaries on the PATH

I think this is ok. This is true for python and pip or any other package as well. The stacked env takes precedence.

Just discussed this and other options for installing Spark 3 with @JAllemandou and @Antoine_Quhen.

Decisions:

To do this, we need to do the following:

  • Figure out how to deal with Hadoop and Hive jars as 'provided' dependencies with latest Spark 3 distribution.
  • Create a new 'debian' repo that uses workflow_utils conda-dist to create a debian packaged conda environment with python3.9 and pyspark 3 installed.
  • Write new or adapt puppetization of spark to work with Spark installed in this conda base env. This includes automating uploading of the spark assembly zip file to HDFS, setting spark-defaults.conf and spark-env.sh configurations, making sure that Hadoop and Hive will work properly, etc.

We can install all of this on the analytics-test-hadoop cluster while developing and testing, and also on the analytics-cluster alongside our current installation of Spark 2 and anaconda-wmf, with the aim of eventually deprecating both of those.

@Ottomata Thanks for this update! The differential privacy project is currently using a jerry-rigged version of Spark 3 to run our software packages, so please let me know (either in this thread on phab or via slack) when you've been able to install Spark 3 on anaconda-wmf.

PS: I don't know if this will be at all useful, but if you want to take a look at how we've gotten Spark 3 working you can find the repo where we do it here.

Change 791323 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Add profile::hadoop:spark3 class and resources

https://gerrit.wikimedia.org/r/791323

Change 791323 merged by Ottomata:

[operations/puppet@production] Add profile::hadoop:spark3 class and resources

https://gerrit.wikimedia.org/r/791323

Change 791457 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Ensure spark3 conf dir exists

https://gerrit.wikimedia.org/r/791457

Change 791457 merged by Ottomata:

[operations/puppet@production] Ensure spark3 conf dir exists

https://gerrit.wikimedia.org/r/791457

@Antoine_Quhen @JAllemandou /etc/spark3/conf is now on an-launcher1002 and an-test-client1001 :) thank you!

Mentioned in SAL (#wikimedia-operations) [2022-05-23T14:08:31Z] <aqu@deploy1002> Started deploy [airflow-dags/analytics_test@95d0f86]: T295072 Spark 3 from Airflow venv pyspark [airflow-dags/analytics_test@95d0f86]

Mentioned in SAL (#wikimedia-operations) [2022-05-23T14:08:40Z] <aqu@deploy1002> Finished deploy [airflow-dags/analytics_test@95d0f86]: T295072 Spark 3 from Airflow venv pyspark [airflow-dags/analytics_test@95d0f86] (duration: 00m 08s)

Mentioned in SAL (#wikimedia-operations) [2022-05-23T14:12:33Z] <aqu@deploy1002> Started deploy [airflow-dags/analytics@95d0f86]: T295072 spark 3 from airflow venv pyspark [airflow-dags/analytics@95d0f86]

Mentioned in SAL (#wikimedia-operations) [2022-05-23T14:12:42Z] <aqu@deploy1002> Finished deploy [airflow-dags/analytics@95d0f86]: T295072 spark 3 from airflow venv pyspark [airflow-dags/analytics@95d0f86] (duration: 00m 08s)

Experimental Spark3 is in use for 1 job triggered by Airflow: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/config/experimental_spark_3_dag_default_args.py

It's using /etc/spark3/conf, and Spark3 provided by pyspark (dependency of Airflow).

Change 805855 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Include spark3 config on all hadoop client nodes

https://gerrit.wikimedia.org/r/805855

Change 805855 merged by Ottomata:

[operations/puppet@production] Include spark3 config on all hadoop client nodes

https://gerrit.wikimedia.org/r/805855

Change 813278 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Puppetize spark3 installation and configs using conda-analytics env

https://gerrit.wikimedia.org/r/813278

Change 813278 merged by Btullis:

[operations/puppet@production] Puppetize spark3 installation and configs using conda-analytics env

https://gerrit.wikimedia.org/r/813278

Change 821278 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Fix the spark3 profile

https://gerrit.wikimedia.org/r/821278

Change 821278 merged by Btullis:

[operations/puppet@production] Fix the spark3 profile

https://gerrit.wikimedia.org/r/821278

Change 821695 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/puppet@production] Puppetize spark3 installation and configs using conda-analytics env V2

https://gerrit.wikimedia.org/r/821695

EChetty set the point value for this task to 9.Aug 16 2022, 3:03 PM
EChetty moved this task from Discussed (Radar) to Sprint 00 on the Data Pipelines board.
EChetty edited projects, added Data Pipelines (Sprint 00); removed Data Pipelines.

pyspark 3 is now installed with conda. The pyspark package in the conda forge is marking those as dependencies:

  • numpy >=1.7
  • pandas >=0.23.2
  • pyarrow >=1.0.0

That's 3 more compared to the pip package.
As we are using a conda environment, installing through conda is recommended.
The environment is now ~500MB from ~250MB.
I think it is ok as it would be used by analysts.
What do you think?

+1

I think those are good deps too have. We’d probably add them to our ‘analytics base env’ anyway.
I’d pin them at more recent versions though, if we can.

Pinned at:

  • numpy=1.23.1
  • pandas=1.4.3
  • pyarrow=8.0.0

Pinned at:

  • numpy=1.23.1
  • pandas=1.4.3
  • pyarrow=8.0.0

+1

I have removed the previously created conda-base-env package from apt.wikimedia.org.

btullis@apt1001:~$ sudo -i reprepro remove buster-wikimedia conda-base-env
Exporting indices...
Deleting files no longer referenced...

Now adding the conda-analytics environment instead.

btullis@apt1001:~$ sudo -i reprepro includedeb buster-wikimedia `pwd`/conda-analytics-0.0.8_amd64.deb
Exporting indices...
Deleting files no longer referenced...

Verified that the new version of conda-analytics is available for install.

btullis@an-test-client1001:~$ apt-cache policy conda-analytics
conda-analytics:
  Installed: 0.0.7
  Candidate: 0.0.8
  Version table:
     0.0.8 1001
       1001 http://apt.wikimedia.org/wikimedia buster-wikimedia/main amd64 Packages
 *** 0.0.7 100
        100 /var/lib/dpkg/status

After a new round of tests and bugfixes, https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/tree/main/test_scripts, I am quite confident with our install of Spark 3.
So, I think it's safe for an SRE to:
1 - upgrade conda-analytics pkg on the test cluster with 0.0.9
https://debmonitor.wikimedia.org/packages/conda-analytics
https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/packages/185
2 - merge the puppet code to setup spark 3 on the test cluster
(I have checked with pcc.)

Change 821695 merged by Btullis:

[operations/puppet@production] Puppetize spark3 installation and configs using conda-analytics env V2

https://gerrit.wikimedia.org/r/821695

I believe that we can close this one as resolved?

xcollazo claimed this task.

Change 901604 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add the spark3 shuffle service jars to the yarn resourcemanager

https://gerrit.wikimedia.org/r/901604

Change 901670 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Upload the spark3-assemly file to HDFS on the test cluster

https://gerrit.wikimedia.org/r/901670

Change 901604 merged by Btullis:

[operations/puppet@production] Use the spark3 shuffle jars to yarn on a test host

https://gerrit.wikimedia.org/r/901604

Change 901670 abandoned by Btullis:

[operations/puppet@production] Upload the spark3-assemly file to HDFS on the test cluster

Reason:

Change of approach. We will be generating the assembly from GitLab-CI and uploading manually.

https://gerrit.wikimedia.org/r/901670