Install spark3 in analytics clusters
Closed, ResolvedPublic9 Estimated Story Points
Actions

Assigned To

Authored By

	fkaelin
	Nov 4 2021, 6:40 PM

Description

Taken from a decision in a comment below:

Just discussed this and other options for installing Spark 3 with @JAllemandou and @Antoine_Quhen.

Create a new debian packaged 'conda base env' with the intention of using this to replace anaconda-wmf as described in T302819: Replace anaconda-wmf with smaller, non-stacked Conda environments. We can install all of this on the analytics-test-hadoop cluster while developing and testing, and also on the analytics-cluster alongside our current installation of Spark 2 and anaconda-wmf, with the aim of eventually deprecating both of those.

For now, we will focus on using the new conda base env to upgrade to Spark 3. Later, we will pursue replacing anaconda-wmf with this new conda base env in T302819.

To do this, we need to:

Figure out how to deal with Hadoop and Hive jars as 'provided' dependencies with latest Spark 3

Create a new 'debian' repo that uses workflow_utils conda-dist to create a debian packaged conda environment with python3.9 and pyspark 3 installed. The workflow_utils README has a good example of how to set up the python env repo for use with conda-dist. This repo and output conda env debian package will hopefully one day replace anaconda-wmf, so we should come up with a good name for it!
1. Create the repo with conda-environment.yaml other dependency spec files so that conda-dist can automate the env packaging.
2. Create a debian/ dir (with scripts or instructions) that can automate running conda-dist and using the output conda dist env to create a debian package of it. This is basically what the unpack_conda_prep_tar_into_debian_tree and build_debian_package bash functions are doing in the airflow env debianization
3. When ready, build the .deb and import to apt.wikimedia.org.

Write new or adapt puppetization of spark to work with Spark installed in this conda base env. This includes automating uploading of the spark assembly zip file to HDFS, setting spark-defaults.conf and spark-env.sh configurations, making sure that Hadoop and Hive will work properly, etc.

Details

Subject	Repo	Branch	Lines +/-
Upload the spark3-assemly file to HDFS on the test cluster	operations/puppet	production	+32 -33
Use the spark3 shuffle jars to yarn on a test host	operations/puppet	production	+33 -32
Puppetize spark3 installation and configs using conda-analytics env V2	operations/puppet	production	+293 -181
Puppetize spark3 installation and configs using conda-analytics env	operations/puppet	production	+282 -173
Fix the spark3 profile	operations/puppet	production	+4 -4
Include spark3 config on all hadoop client nodes	operations/puppet	production	+4 -7
Ensure spark3 conf dir exists	operations/puppet	production	+8 -0
Add profile::hadoop:spark3 class and resources	operations/puppet	production	+545 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T291464 Upgrade analytics-hadoop to Spark 3 + scala 2.12
Resolved	None	T302819 Replace anaconda-wmf with smaller, non-stacked Conda environments
Resolved	Antoine_Quhen	T295072 Install spark3 in analytics clusters
Resolved	Antoine_Quhen	T309227 Create conda-base-env with last pyspark
Resolved	Antoine_Quhen	T312882 Puppetize Spark 3 installation using conda-analytics env
Resolved	Antoine_Quhen	T310578 Build and install spark3 assembly
Resolved	xcollazo	T315478 Optimize spark3 conda deb generation

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 4 2021, 6:40 PM

Htriedman subscribed.Nov 4 2021, 7:04 PM

mforns added a project: Data Pipelines.Nov 5 2021, 4:11 PM

mforns subscribed.

Ottomata added a parent task: T291465: Analytics-test-hadoop Spark3 package upgrade.Nov 12 2021, 9:51 PM

Ottomata edited parent tasks, added: T291464: Upgrade analytics-hadoop to Spark 3 + scala 2.12; removed: T291465: Analytics-test-hadoop Spark3 package upgrade.

Looked into this a little bit today. Some context:

We create our own spark debian package using the Spark released distribution
We hack the Spark distribution to remove included Hadoop dependencies, but leave other ones like Hive. (We could use the hadoopless distribution, but in previous tests it didn't work because it missed other things, like Hive).
For python, it relies on the system installed python (unless manually overridden using PYSPARK_PYTHON).
- We work around this by including any binary python dependencies for possible differing major versions of python we might use across the cluster, e.g. in the case of an OS upgrade.

This led me to investigate various options of installing spark. Important info:

Which gave me an idea!

Instead of relying on the system python, we could do like we've begun to do elsewhere: rely on conda environments to standardize the version of python we are using. Doing so could get tricky if the user also wants to use our anaconda-wmf stacked conda environment support...but not if our anaconda-wmf just had our version of spark installed.

I just tried in a couple of conda envs, installed with pip and with conda. Both seem to work fine, and also include spark scala / java dependencies and CLIs!

Spark R is not included. There do seem to be some conda packages for Spark R out there. I've always had a hard time maintaining Spark R support, since our team (and most of our users?) don't use it. I'd be willing to drop official support for Spark R, and ask users to install it themselves into their conda environments if they need to use it.

So, I propose seeing if we can install Spark 3 into our anaconda-wmf package. This may take a little bit of hackery to make sure the hadoop & hive jars and/or the classpaths are set properly and use our system installed Hadoop and Hive jars, TBD.

If this works, we'll be able to manage Spark and Python dependencies much more easily; just by upgrading them in anaconda-wmf.

Launching e.g. pyspark3 would then be done like:

/usr/lib/anaconda-wmf/bin/pyspark

Launching a Spark 3 scala shell:

/usr/lib/anaconda-wmf/bin/spark-shell

etc.

If you had your own activated and stacked conda env, then just: pyspark, spark-submit, etc.

Ottomata added subscribers: BTullis, • razzi.Nov 12 2021, 10:07 PM

That is interesting - this would be a stopgap until data eng is upgrading to a puppetized spark3? I imagine so, since there are no-conda prod use cases of spark.

Installing in the base image might be confusing since there will be multiple spark binaries on the PATH after activating a stacked conda env, it might make sense to just have a wiki describing how to install/configure spark3 in your own conda env on per-need basis.

odimitrijevic triaged this task as High priority.Dec 3 2021, 11:46 PM

odimitrijevic moved this task from Incoming (new tickets) to Transform on the Data-Engineering board.

dcausse awarded a token.Jan 5 2022, 8:59 AM

dcausse subscribed.

this would be a stopgap until data eng is upgrading to a puppetized spark3?

No, I think we would also use the anaconda-wmf spark3 installation. And use anaconda-wmf as the default python whenever we need to?

Being able to manage python and spark and other dependency versions in a self contained environment will make a lot of maintenance tasks much easier.

Installing in the base image might be confusing since there will be multiple spark binaries on the PATH

I think this is ok. This is true for python and pip or any other package as well. The stacked env takes precedence.

Just discussed this and other options for installing Spark 3 with @JAllemandou and @Antoine_Quhen.

Decisions:

We will create a new debian packaged 'conda base env' with the intention of using this to replace anaconda-wmf as described in T302819: Replace anaconda-wmf with smaller, non-stacked Conda environments.
For now, we will focus on using the new conda base env to upgrade to Spark 3.

To do this, we need to do the following:

Figure out how to deal with Hadoop and Hive jars as 'provided' dependencies with latest Spark 3 distribution.

Create a new 'debian' repo that uses workflow_utils conda-dist to create a debian packaged conda environment with python3.9 and pyspark 3 installed.

Write new or adapt puppetization of spark to work with Spark installed in this conda base env. This includes automating uploading of the spark assembly zip file to HDFS, setting spark-defaults.conf and spark-env.sh configurations, making sure that Hadoop and Hive will work properly, etc.

We can install all of this on the analytics-test-hadoop cluster while developing and testing, and also on the analytics-cluster alongside our current installation of Spark 2 and anaconda-wmf, with the aim of eventually deprecating both of those.

@Ottomata Thanks for this update! The differential privacy project is currently using a jerry-rigged version of Spark 3 to run our software packages, so please let me know (either in this thread on phab or via slack) when you've been able to install Spark 3 on anaconda-wmf.

PS: I don't know if this will be at all useful, but if you want to take a look at how we've gotten Spark 3 working you can find the repo where we do it here.

Nice! Will do.

Ottomata added a parent task: T302819: Replace anaconda-wmf with smaller, non-stacked Conda environments.May 11 2022, 5:33 PM

Ottomata updated the task description. (Show Details)May 11 2022, 5:40 PM

Ottomata added a subscriber: odimitrijevic.

Ottomata renamed this task from Install spark3 to Install spark3 in analytics clusters.May 11 2022, 5:44 PM

Ottomata merged a task: T291466: Analytics-hadoop Spark3 package upgrade (production).

Ottomata merged a task: T291465: Analytics-test-hadoop Spark3 package upgrade.

Ottomata added a subscriber: Milimetric.

Ottomata updated the task description. (Show Details)May 11 2022, 5:53 PM

Change 791323 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Add profile::hadoop:spark3 class and resources

https://gerrit.wikimedia.org/r/791323

gerritbot added a project: Patch-For-Review.May 12 2022, 9:54 AM

Change 791323 merged by Ottomata:

[operations/puppet@production] Add profile::hadoop:spark3 class and resources

https://gerrit.wikimedia.org/r/791323

Change 791457 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Ensure spark3 conf dir exists

https://gerrit.wikimedia.org/r/791457

Change 791457 merged by Ottomata:

[operations/puppet@production] Ensure spark3 conf dir exists

https://gerrit.wikimedia.org/r/791457

@Antoine_Quhen @JAllemandou /etc/spark3/conf is now on an-launcher1002 and an-test-client1001 :) thank you!

Maintenance_bot removed a project: Patch-For-Review.May 12 2022, 10:30 PM

Mentioned in SAL (#wikimedia-operations) [2022-05-23T14:08:31Z] <aqu@deploy1002> Started deploy [airflow-dags/analytics_test@95d0f86]: T295072 Spark 3 from Airflow venv pyspark [airflow-dags/analytics_test@95d0f86]

Mentioned in SAL (#wikimedia-operations) [2022-05-23T14:08:40Z] <aqu@deploy1002> Finished deploy [airflow-dags/analytics_test@95d0f86]: T295072 Spark 3 from Airflow venv pyspark [airflow-dags/analytics_test@95d0f86] (duration: 00m 08s)

Mentioned in SAL (#wikimedia-operations) [2022-05-23T14:12:33Z] <aqu@deploy1002> Started deploy [airflow-dags/analytics@95d0f86]: T295072 spark 3 from airflow venv pyspark [airflow-dags/analytics@95d0f86]

Mentioned in SAL (#wikimedia-operations) [2022-05-23T14:12:42Z] <aqu@deploy1002> Finished deploy [airflow-dags/analytics@95d0f86]: T295072 spark 3 from airflow venv pyspark [airflow-dags/analytics@95d0f86] (duration: 00m 08s)

Experimental Spark3 is in use for 1 job triggered by Airflow: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/config/experimental_spark_3_dag_default_args.py

It's using /etc/spark3/conf, and Spark3 provided by pyspark (dependency of Airflow).

Very cool!

Antoine_Quhen mentioned this in T309227: Create conda-base-env with last pyspark.May 25 2022, 4:24 PM

Antoine_Quhen added a subtask: T309227: Create conda-base-env with last pyspark.May 25 2022, 4:27 PM

Change 805855 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Include spark3 config on all hadoop client nodes

https://gerrit.wikimedia.org/r/805855

gerritbot added a project: Patch-For-Review.Jun 15 2022, 4:18 PM

Change 805855 merged by Ottomata:

[operations/puppet@production] Include spark3 config on all hadoop client nodes

https://gerrit.wikimedia.org/r/805855

Maintenance_bot removed a project: Patch-For-Review.Jun 15 2022, 4:29 PM

Change 813278 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Puppetize spark3 installation and configs using conda-analytics env

https://gerrit.wikimedia.org/r/813278

gerritbot added a project: Patch-For-Review.Jul 12 2022, 5:30 PM

nshahquinn-wmf subscribed.Jul 18 2022, 3:02 PM

mfossati subscribed.Jul 18 2022, 4:26 PM

xcollazo subscribed.Jul 19 2022, 3:22 PM

Ottomata mentioned this in T309717: Event Utilities partially downloads schemas.Aug 8 2022, 3:10 PM

Change 813278 merged by Btullis:

[operations/puppet@production] Puppetize spark3 installation and configs using conda-analytics env

https://gerrit.wikimedia.org/r/813278

Change 821278 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Fix the spark3 profile

https://gerrit.wikimedia.org/r/821278

Change 821278 merged by Btullis:

[operations/puppet@production] Fix the spark3 profile

https://gerrit.wikimedia.org/r/821278

Change 821695 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/puppet@production] Puppetize spark3 installation and configs using conda-analytics env V2

https://gerrit.wikimedia.org/r/821695

• EChetty moved this task from Backlog to Discussed (Radar) on the Data Pipelines board.Aug 16 2022, 1:55 PM

• EChetty set the point value for this task to 9.Aug 16 2022, 3:03 PM

• EChetty moved this task from Discussed (Radar) to Sprint 00 on the Data Pipelines board.

• EChetty edited projects, added Data Pipelines (Sprint 00); removed Data Pipelines.

Antoine_Quhen added a subtask: T315475: Convert to pure Docker the gitlab CI pipeline to build debianized conda.Aug 17 2022, 6:50 PM

Antoine_Quhen added a subtask: T315478: Optimize spark3 conda deb generation.Aug 17 2022, 7:02 PM

pyspark 3 is now installed with conda. The pyspark package in the conda forge is marking those as dependencies:

numpy >=1.7
pandas >=0.23.2
pyarrow >=1.0.0

That's 3 more compared to the pip package.
As we are using a conda environment, installing through conda is recommended.
The environment is now ~500MB from ~250MB.
I think it is ok as it would be used by analysts.
What do you think?

I think those are good deps too have. We’d probably add them to our ‘analytics base env’ anyway.
I’d pin them at more recent versions though, if we can.

Pinned at:

numpy=1.23.1
pandas=1.4.3
pyarrow=8.0.0

• EChetty edited projects, added Data Pipelines, Epic; removed Data Pipelines (Sprint 00), Patch-For-Review, Data-Engineering.Aug 19 2022, 11:19 AM

• EChetty moved this task from Backlog to Epics on the Data Pipelines board.

In T295072#8168379, @Antoine_Quhen wrote:

Pinned at:

numpy=1.23.1

pandas=1.4.3

pyarrow=8.0.0

I have removed the previously created conda-base-env package from apt.wikimedia.org.

btullis@apt1001:~$ sudo -i reprepro remove buster-wikimedia conda-base-env
Exporting indices...
Deleting files no longer referenced...

Now adding the conda-analytics environment instead.

btullis@apt1001:~$ sudo -i reprepro includedeb buster-wikimedia `pwd`/conda-analytics-0.0.8_amd64.deb
Exporting indices...
Deleting files no longer referenced...

Verified that the new version of conda-analytics is available for install.

btullis@an-test-client1001:~$ apt-cache policy conda-analytics
conda-analytics:
  Installed: 0.0.7
  Candidate: 0.0.8
  Version table:
     0.0.8 1001
       1001 http://apt.wikimedia.org/wikimedia buster-wikimedia/main amd64 Packages
 *** 0.0.7 100
        100 /var/lib/dpkg/status

After a new round of tests and bugfixes, https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/tree/main/test_scripts, I am quite confident with our install of Spark 3.
So, I think it's safe for an SRE to:
1 - upgrade conda-analytics pkg on the test cluster with 0.0.9
https://debmonitor.wikimedia.org/packages/conda-analytics
https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/packages/185
2 - merge the puppet code to setup spark 3 on the test cluster
(I have checked with pcc.)

Change 821695 merged by Btullis:

[operations/puppet@production] Puppetize spark3 installation and configs using conda-analytics env V2

https://gerrit.wikimedia.org/r/821695

• EChetty closed subtask T309227: Create conda-base-env with last pyspark as Resolved.Sep 6 2022, 2:27 PM

xcollazo reopened subtask T315475: Convert to pure Docker the gitlab CI pipeline to build debianized conda as Open.Nov 2 2022, 8:19 PM

xcollazo closed subtask T315478: Optimize spark3 conda deb generation as Resolved.Nov 2 2022, 8:24 PM

I believe that we can close this one as resolved?

Yes, I think so.

xcollazo closed this task as Resolved.Nov 4 2022, 2:29 PM

xcollazo claimed this task.

mfossati mentioned this in T323107: [M] Upgrade code base to Spark 3.Nov 15 2022, 12:01 PM

mfossati mentioned this in T323108: [M] Upgrade code base to Spark 3.Nov 15 2022, 12:06 PM

xcollazo reassigned this task from xcollazo to Antoine_Quhen.Nov 30 2022, 7:34 PM

Change 901604 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add the spark3 shuffle service jars to the yarn resourcemanager

https://gerrit.wikimedia.org/r/901604

gerritbot added a project: Patch-For-Review.Mar 21 2023, 2:50 PM

Change 901670 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Upload the spark3-assemly file to HDFS on the test cluster

https://gerrit.wikimedia.org/r/901670

Change 901604 merged by Btullis:

[operations/puppet@production] Use the spark3 shuffle jars to yarn on a test host

https://gerrit.wikimedia.org/r/901604

Change 901670 abandoned by Btullis:

[operations/puppet@production] Upload the spark3-assemly file to HDFS on the test cluster

Reason:

Change of approach. We will be generating the assembly from GitLab-CI and uploading manually.

https://gerrit.wikimedia.org/r/901670

nshahquinn-wmf removed a subtask: T315475: Convert to pure Docker the gitlab CI pipeline to build debianized conda.Feb 9 2024, 9:22 PM

Install spark3 in analytics clustersClosed, ResolvedPublic9 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Install spark3 in analytics clusters
Closed, ResolvedPublic9 Estimated Story Points
Actions

Related Objects
Search...