Page MenuHomePhabricator

Build and install spark3 assembly
Closed, ResolvedPublic3 Estimated Story Points

Description

Instead of having to copy spark libraries to HDFS for every job, it's best practice to use a single spark-assembly jar stored on HDFS.
The jar can be built this way:

jar cv0f spark-3.1.2-assembly.zip -C /usr/lib/airflow/lib/python3.7/site-packages/pyspark/jars/ .

Then this file should be copied to HDFS if it's not already existing at path:

/user/spark/share/lib/spark-3.1.2-assembly.zip

Once this is done, we can update the spark3 configuration to reference the assembly:
https://github.com/wikimedia/puppet/blob/13dd484c4012d3c978ff7ccc244767adb5977610/modules/profile/templates/hadoop/spark3-defaults.conf.erb#L51

And finally remove the setting from the airflow config:
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/config/experimental_spark_3_dag_default_args.py#L27

Event Timeline

Eventually, we may point directly to the local assembly jars located in the workers.

Let's try to test this configuration on the test cluster.

EChetty set the point value for this task to 3.Jun 30 2022, 5:21 PM

Change 810951 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/puppet@production] [WIP] Build spark assembly for Spark3

https://gerrit.wikimedia.org/r/810951

Last resolution about this ticket:

  • forget about complete automatization (puppet or ci)
  • add doc + a shell script for any analytics user to create the file on hdfs

TODO: are we sure we want to call this a '.jar' file? A jar is a zip, but i wouldn't expect a .jar file to contain other jars.

Change 810951 abandoned by Aqu:

[operations/puppet@production] [WIP] Build spark yarn archive for Spark 3 from conda-analytics package

Reason:

https://gerrit.wikimedia.org/r/810951