Make it easier to run custom Spark versions via for_virtual_env()
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	xcollazo
	Sep 1 2023, 3:17 PM

Description

Our Hadoop cluster currently supports Spark 3.1.2. In T340861, a lot of debugging was needed to get a Spark 3.3.2 conda environment to work.

For reference, a working call to for_virtual_env() for Spark 3.3.2:

merge_into = SparkSubmitOperator.for_virtualenv(
    task_id="spark_merge_into",
    virtualenv_archive=props.conda_env,
    entry_point="bin/events_merge_into.py",
    driver_memory=props.driver_memory,
    driver_cores=props.driver_cores,
    executor_memory=props.executor_memory,
    executor_cores=props.executor_cores,
    num_executors=props.num_executors,
    conf={
        "spark.driver.maxResultSize": props.spark_driver_maxResultSize,
        "spark.shuffle.service.enabled": props.spark_shuffle_service_enabled,
        "spark.dynamicAllocation.enabled": props.spark_dynamicAllocation_enabled,
        "spark.jars.packages": "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.2.1",
        "spark.driver.extraJavaOptions": "-Divy.cache.dir=/tmp/ivy_spark3/cache -Divy.home=/tmp/ivy_spark3/home",  # fix jar pulling  # noqa
        "spark.jars.ivySettings": "/etc/maven/ivysettings.xml",  # fix jar pulling
        "spark.yarn.archive": "hdfs:///user/xcollazo/artifacts/spark-3.3.2-assembly.zip",  # override 3.1's assembly
    },
    launcher="skein",
    application_args=args,
    use_virtualenv_spark=True,
    default_env_vars={
        "SPARK_HOME": "venv/lib/python3.10/site-packages/pyspark",  # point to the packaged Spark
        "SPARK_CONF_DIR": "/etc/spark3/conf",
    },
)

Some of the issues:

(1) Incompatibility between Spark YARN assemblies: When running Spark on top of YARN, Spark requires to have all its jars available to all executors. If spark.yarn.archive is not defined, Yarn automatically generates and distributes this archive. If spark.yarn.archive is set, YARN blindly copies this archive, regardless of what Spark version is attempting to run.

Option 1: Although there is a minor perf cost to not set spark.yarn.archive, we should consider not setting it in our default configuration so that folks can run whatever Spark version they want.
Option 2: Have a bunch of assemblies 'officially' available: right now, perhaps that set could be 3.1.2, 3.3.2, 3.4.1. But this is problematic as it requires SRE cycles every time someone wants to run a different version.

(2) Incompatibility between Spark Shuffle Service: An external Shuffle Service in Spark allows dynamic sizing of jobs, as well as taking away coordination responsibilities from executors. Although typically there is good shuffler compatibility between versions, Spark's 3.1.2 shuffler is not forward compatible with 3.3.2. We should consider having multiple shuffle services. This is being taken care by T344910.

(3) Bugs in for_virtual_env(): Current code clears SPARK_CONF_DIR when use_virtualenv_spark=True. This should not happen. Similarly, SPARK_HOME is complicated to figure out. Perhaps we could figure this automatically when use_virtualenv_spark=True.

(4) Figure out why copy pasting the Airflow Spark script doesn't seem to work for one off runs: @Milimetric reports that ad-hoc runs fail. Let's investigate this as we should be able to debug this.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Milimetric	T330296 Dumps 2.0 Phase I: Proof of concept for MediaWiki XML content dump via Event Platform, Iceberg and Spark
		Resolved		xcollazo	T345440 Make it easier to run custom Spark versions via for_virtual_env()

Event Timeline

xcollazo created this task.Sep 1 2023, 3:17 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 1 2023, 3:17 PM

VirginiaPoundstone moved this task from Incoming to To be discussed/To be estimated on the Experimentation Lab board.Sep 1 2023, 8:11 PM

VirginiaPoundstone added a project: Dumps 2.0.

xcollazo mentioned this in T344910: Deploy additional yarn shuffler services to support several versions of spark in parallel.Sep 5 2023, 3:06 PM

@BTullis regarding point (1) in description above, what is your opinion on options 1 and 2 (or other options I didn't think of)?

xcollazo updated the task description. (Show Details)Sep 8 2023, 8:20 PM

xcollazo moved this task from To be discussed/To be estimated to Sprint 02 on the Experimentation Lab board.Oct 3 2023, 1:46 PM

xcollazo edited projects, added Experimentation Lab (Sprint 02); removed Experimentation Lab.

xcollazo set the point value for this task to 5.Oct 13 2023, 5:02 PM

• WDoranWMF moved this task from Sprint Backlog to Teleport to Sprint 03 on the Experimentation Lab (Sprint 02) board.Oct 24 2023, 12:56 PM

• WDoranWMF edited projects, added Experimentation Lab (Data Products (Sprint 03)); removed Experimentation Lab (Sprint 02).

VirginiaPoundstone triaged this task as Low priority.Oct 30 2023, 7:41 PM

• WDoranWMF moved this task from Sprint Backlog to Wormhole to Sprint 04 on the Experimentation Lab (Data Products (Sprint 03)) board.Nov 16 2023, 1:38 PM

• WDoranWMF edited projects, added Experimentation Lab (Data Products Sprint 04); removed Experimentation Lab (Data Products (Sprint 03)).

• WDoranWMF moved this task from Sprint Backlog to Portal to Sprint 05 on the Experimentation Lab (Data Products Sprint 04) board.Dec 11 2023, 5:06 PM

• WDoranWMF edited projects, added Experimentation Lab (Data Products Sprint 05); removed Experimentation Lab (Data Products Sprint 04).

• WDoranWMF moved this task from Sprint Backlog to Back to main backlog on the Experimentation Lab (Data Products Sprint 05) board.Jan 9 2024, 6:23 PM

• WDoranWMF edited projects, added Experimentation Lab; removed Experimentation Lab (Data Products Sprint 05).

VirginiaPoundstone moved this task from Incoming to NEEDS DISCUSSION on the Experimentation Lab board.Jan 10 2024, 4:48 PM

A summary of the original issues, for closure:

(1) Incompatibility between Spark YARN assemblies: When running Spark on top of YARN, Spark requires to have all its jars available to all executors. If spark.yarn.archive is not defined, Yarn automatically generates and distributes this archive. If spark.yarn.archive is set, YARN blindly copies this archive, regardless of what Spark version is attempting to run.

Option 1: Although there is a minor perf cost to not set spark.yarn.archive, we should consider not setting it in our default configuration so that folks can run whatever Spark version they want.

Option 2: Have a bunch of assemblies 'officially' available: right now, perhaps that set could be 3.1.2, 3.3.2, 3.4.1. But this is problematic as it requires SRE cycles every time someone wants to run a different version.

We decided on option 2. In the end it is not onerous to generate the assembly and make it available to all via HDFS.

(2) Incompatibility between Spark Shuffle Service: An external Shuffle Service in Spark allows dynamic sizing of jobs, as well as taking away coordination responsibilities from executors. Although typically there is good shuffler compatibility between versions, Spark's 3.1.2 shuffler is not forward compatible with 3.3.2. We should consider having multiple shuffle services. This is being taken care by T344910.

T344910 fixed this.

(3) Bugs in for_virtual_env(): Current code clears SPARK_CONF_DIR when use_virtualenv_spark=True. This should not happen. Similarly, SPARK_HOME is complicated to figure out. Perhaps we could figure this automatically when use_virtualenv_spark=True.

This is still an issue, but I have decided not to pursue it further, as there is little gain to be had considering that we have now documented working code.

(4) Figure out why copy pasting the Airflow Spark script doesn't seem to work for one off runs: @Milimetric reports that ad-hoc runs fail. Let's investigate this as we should be able to debug this.

I have not hit this issue myself, and have been able to run code manually.

Make it easier to run custom Spark versions via for_virtual_env()Closed, ResolvedPublic5 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Make it easier to run custom Spark versions via for_virtual_env()
Closed, ResolvedPublic5 Estimated Story Points
Actions

Related Objects
Search...