Page MenuHomePhabricator

Figure out a way to automatize deployment of the spark assembly file
Open, MediumPublic

Description

(Related to T335721)

Yarn requires the definition of spark.yarn.archive, a zip or jar file containing all the jars from Spark. This is needed so that we don't loose time at the beginning of every Spark job uploading all of its jars.

We have a manual solution for Spark3 at https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/blob/main/generate_spark_assembly.sh.

We also have an automated solution for this for Spark2 here. However, this solution does not work with our new way of deploying Spark via pyspark.

In this task we should reconcile these two approaches and make it automated for Spark3 as well.

Event Timeline

An easy fix for this is to just delete all assembly files, and remove spark.yarn.archive from our spark config.

The con from this approach is that the assembly will be zipped and shipped to the cluster on every spark job. I do wonder how long this takes... perhaps just a couple seconds?

Gehel triaged this task as Medium priority.Nov 15 2023, 9:38 AM
Gehel moved this task from Incoming to Ready for Work on the Data-Platform-SRE board.