Page MenuHomePhabricator

Ensure that we can submit spark jobs via `spark3-submit` from airflow
Closed, ResolvedPublic

Description

Right now, we have the spark-submit executable in the airflow image, but it cannot run due to java not being installed.

airflow@airflow-scheduler-66d44f5d5b-9vcsc:/opt/airflow$ spark-submit --help
JAVA_HOME is not set
airflow@airflow-scheduler-66d44f5d5b-9vcsc:/opt/airflow$ which java
airflow@airflow-scheduler-66d44f5d5b-9vcsc:/opt/airflow$

Our DAGs also refer to spark3-submit in the Skein jobs they run (example), and not spark-submit, so we need to make sure both work.

Event Timeline

The airflow hosts all run bullseye, python3.10 and OpenJDK8. The airflow containers run on bookworm and python3.11. Both installations feature pyspark 3.1.2, which bundles the Spark jars, but the airflow containers currently don't have any JDK installed.

One questions is: what version of OpenJDK do we need to install?

This is what we see on an airflow host:

brouberol@an-airflow1004:~$ "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit --version
SPARK_HOME: /opt/conda-analytics/lib/python3.10/site-packages/pyspark
Using Hadoop client lib jars at 3.2.0, provided by Spark.
PYSPARK_PYTHON=/opt/conda-analytics/bin/python3
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.2
      /_/

Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_412
Branch HEAD
Compiled by user centos on 2021-05-24T04:27:48Z
Revision de351e30a90dd988b133b3d00fa6218bfcaba8b8
Url https://github.com/apache/spark
Type --help for more information.

Looking at https://community.cloudera.com/t5/Community-Articles/Spark-and-Java-versions-Supportability-Matrix/ta-p/383669, we see that Spark 3.1.2 does support OpenJDK 11. However, we're also running an old Hadoop version that caused issues when the Spark History Server was interacting with HDFS directly, when running on OpenJDK 11 (https://phabricator.wikimedia.org/P54524#221159).

I don't think spark-submit would interact with HDFS directly, so OpenJDK 11 should be alright. @JAllemandou, would you be able to confirm / infirm please? Thanks!

It seems that neither OpenJDK 8 nor 11 can be installed on Bookworm. @MoritzMuehlenhoff, once we settle on a JDK version to run, would it be possible to setup a backport repo for Bookworm? Would you need a separate ticket for that? Thanks

It seems that neither OpenJDK 8 nor 11 can be installed on Bookworm. @MoritzMuehlenhoff, once we settle on a JDK version to run, would it be possible to setup a backport repo for Bookworm? Would you need a separate ticket for that? Thanks

I've started to build a forward port of OpenJDK 8 for Bookworm (also to be used for BigTop/Hadoop). It's a little more complex than a straight rebuild since Java 8 needs Java 8 to build itself, which involves some bootstrap kung fu, but I'll have something ready in the next days. I'll let you know when it's ready.

I've talked to @JAllemandou to get a better understanding of what we need to support, in terms of Airflow and Spark interactions.

We do need to support shelling out to spark3-submit within the airflow task to submit a Spark job to YARN. While we also have the SparkApplication and the SparkKubernetesOperator, they schedule the entire execution to be in Kubernetes instead of YARN, and we need to continue supporting executing the Spark job in YARN.

What that entails is the following:

  • we need to install OpenJDK8, due to our Hadoop version
  • we need to have the spark3-submit binary be a symlink to spark-submit, as we use it extensively in airflow-dags
  • Spark requires the following configuration files, that we can copy verbatim from an-launcher:
root@an-launcher1002:~# tree /etc/hadoop/conf /etc/spark3/conf/
/etc/hadoop/conf
├── container-executor.cfg # no
├── core-site.xml # yes
├── hadoop-env.sh # yes
├── hdfs-site.xml # yes
├── log4j.properties # yes
├── mapred-site.xml # no
├── net-topology.ini # no
├── spark_shuffle_3_1_config
│   └── spark-shuffle-site.xml # no
├── spark_shuffle_3_3_config
│   └── spark-shuffle-site.xml # no
├── spark_shuffle_3_4_config
│   └── spark-shuffle-site.xml # no
├── yarn-env.sh # yes
└── yarn-site.xml # yes
/etc/spark3/conf/
├── hive-site.xml -> /etc/hive/conf.analytics-hadoop/hive-site.xml # yes
├── log4j.properties # yes
├── spark-defaults.conf # yes
└── spark-env.sh # yes

Spark itself needs network access to:

  • yarn.wikimedia.org (we probably will need to create an external_services entry for that). Figure out whether we can bypass ATS and the public wikimedia.org subdomain.
  • hdfs (an external_services entry already exists)
  • hive metastore (an external_services entry already exists)

we need to have the spark3-submit binary be a symlink to spark-submit, as we use it extensively in airflow-dags

This could be changed.
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/config/dag_default_args.py#L49

https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/wmf_airflow_common/hooks/spark.py#L143

Spark requires the following configuration files, that we can copy verbatim from an-launcher:

Oh ya! These are currently managed by puppet! How will we get these into containers, and and how will they be kept in sync?

Change #1084818 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/docker-images/production-images@master] Publish JDK8 images based on Debian Bookworm

https://gerrit.wikimedia.org/r/1084818

How will we get these into containers

We'll render them via configmaps and mount them into the task pods.

how will they be kept in sync?

Manually. Any change we make in puppet will need to be reflected in deployment-charts during the transition period.

Change #1087135 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] global_config: expose additional ports on hadoop masters/workers

https://gerrit.wikimedia.org/r/1087135

Change #1087136 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] global_config: define external services entries for the hive metastore servers

https://gerrit.wikimedia.org/r/1087136

Change #1084818 merged by Brouberol:

[operations/docker-images/production-images@master] Publish JDK8 images based on Debian Bookworm

https://gerrit.wikimedia.org/r/1084818

We have built production images based on Bookworm and with JDK8 installed:

root@build2001:/srv/images/production-images# docker-pkg -c /etc/production-images/config.yaml build images --select '*openjdk-8*'
== Step 0: scanning /srv/images/production-images/images ==
Will build the following images:
* docker-registry.discovery.wmnet/openjdk-8-jre-bookworm:8u422-b05-1-20241030
* docker-registry.discovery.wmnet/openjdk-8-jdk-bookworm:8u422-b05-1-20241030
== Step 1: building images ==
* Built image docker-registry.discovery.wmnet/openjdk-8-jre-bookworm:8u422-b05-1-20241030
* Built image docker-registry.discovery.wmnet/openjdk-8-jdk-bookworm:8u422-b05-1-20241030
== Step 2: publishing ==
Successfully published image docker-registry.discovery.wmnet/openjdk-8-jre-bookworm:8u422-b05-1-20241030
Successfully published image docker-registry.discovery.wmnet/openjdk-8-jdk-bookworm:8u422-b05-1-20241030
== Build done! ==
You can see the logs at ./docker-pkg-build.log

We've managed to build a Bullseye-based image for airflow with OpenJDK8, as we're currently struggling to build BigTop 1.5 for Bookworm atm (cf T378954): https://gitlab.wikimedia.org/repos/data-engineering/airflow/-/merge_requests/22

Change #1087135 merged by Brouberol:

[operations/puppet@production] global_config: expose additional ports on hadoop masters/workers

https://gerrit.wikimedia.org/r/1087135

Change #1087136 merged by Brouberol:

[operations/puppet@production] global_config: define external services entries for the hive metastore servers

https://gerrit.wikimedia.org/r/1087136

The new Bullseye image seems to be in good shape. However, we have a permission error preventing python from being able to read the dependency modules:

airflow@d042e353c760:/opt/airflow$ java -version
openjdk version "1.8.0_412"
OpenJDK Runtime Environment (build 1.8.0_412-8u412-ga-1~deb11u1-b08)
OpenJDK 64-Bit Server VM (build 25.412-b08, mixed mode)
airflow@d042e353c760:/opt/airflow$ spark3-submit
/opt/airflow/bin/spark3-submit: line 27: /usr/local/lib/python3.9/site-packages/pyspark/bin/spark-class: Permission denied
/opt/airflow/bin/spark3-submit: line 27: exec: /usr/local/lib/python3.9/site-packages/pyspark/bin/spark-class: cannot execute: Permission denied
airflow@d042e353c760:/opt/airflow$ spark-submit
/usr/local/bin/spark-submit: line 27: /usr/local/lib/python3.9/site-packages/pyspark/bin/spark-class: Permission denied
/usr/local/bin/spark-submit: line 27: exec: /usr/local/lib/python3.9/site-packages/pyspark/bin/spark-class: cannot execute: Permission denied
airflow@d042e353c760:/opt/airflow$ ls $HADOOP_HDFS_HOME
bin                                  hadoop-hdfs-client.jar                      hadoop-hdfs-nfs.jar               lib
hadoop-hdfs-2.10.2-tests.jar         hadoop-hdfs-native-client-2.10.2-tests.jar  hadoop-hdfs-rbf-2.10.2-tests.jar  sbin
hadoop-hdfs-2.10.2.jar               hadoop-hdfs-native-client-2.10.2.jar        hadoop-hdfs-rbf-2.10.2.jar        webapps
hadoop-hdfs-client-2.10.2-tests.jar  hadoop-hdfs-native-client.jar               hadoop-hdfs-rbf.jar
hadoop-hdfs-client-2.10.2.jar        hadoop-hdfs-nfs-2.10.2.jar                  hadoop-hdfs.jar
airflow@d042e353c760:/opt/airflow$ ls $HADOOP_YARN_HOME
bin                                                        hadoop-yarn-server-nodemanager-2.10.2.jar
etc                                                        hadoop-yarn-server-nodemanager.jar
hadoop-yarn-api-2.10.2.jar                                 hadoop-yarn-server-resourcemanager-2.10.2.jar
hadoop-yarn-api.jar                                        hadoop-yarn-server-resourcemanager.jar
hadoop-yarn-applications-distributedshell-2.10.2.jar       hadoop-yarn-server-router-2.10.2.jar
hadoop-yarn-applications-distributedshell.jar              hadoop-yarn-server-router.jar
hadoop-yarn-applications-unmanaged-am-launcher-2.10.2.jar  hadoop-yarn-server-sharedcachemanager-2.10.2.jar
hadoop-yarn-applications-unmanaged-am-launcher.jar         hadoop-yarn-server-sharedcachemanager.jar
hadoop-yarn-client-2.10.2.jar                              hadoop-yarn-server-tests-2.10.2.jar
hadoop-yarn-client.jar                                     hadoop-yarn-server-tests.jar
hadoop-yarn-common-2.10.2.jar                              hadoop-yarn-server-timeline-pluginstorage-2.10.2.jar
hadoop-yarn-common.jar                                     hadoop-yarn-server-timeline-pluginstorage.jar
hadoop-yarn-registry-2.10.2.jar                            hadoop-yarn-server-web-proxy-2.10.2.jar
hadoop-yarn-registry.jar                                   hadoop-yarn-server-web-proxy.jar
hadoop-yarn-server-applicationhistoryservice-2.10.2.jar    lib
hadoop-yarn-server-applicationhistoryservice.jar           sbin
hadoop-yarn-server-common-2.10.2.jar                       timelineservice
hadoop-yarn-server-common.jar
airflow@d042e353c760:/opt/airflow$ ls $JAVA_HOME
ASSEMBLY_EXCEPTION  THIRD_PARTY_README  bin  lib  man
airflow@d042e353c760:/opt/airflow$ ls $SPARK_HOME
ls: cannot access '/usr/local/lib/python3.9/site-packages/pyspark': Permission denied
airflow@d042e353c760:/opt/airflow$ ls -al /usr/local/lib/python3.9
total 32
drwxr-xr-x   1 app  app   4096 Nov  5 10:29 .
drwxr-xr-x   1 app  app   4096 Nov  5 10:29 ..
drwxr-xr-x   2 root root  4096 Nov  5 10:22 dist-packages
drwx------ 458 app  app  20480 Nov  5 10:29 site-packages

Change #1087454 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: export a PYTHONPATH env var reflecting the new bullseye based image

https://gerrit.wikimedia.org/r/1087454

Change #1087454 merged by Brouberol:

[operations/deployment-charts@master] airflow: export a PYTHONPATH env var reflecting the new bullseye based image

https://gerrit.wikimedia.org/r/1087454

airflow@airflow-scheduler-59874c6987-n52kr:/opt/airflow$ spark3-submit
Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn,
                              k8s://https://host:port, or local (Default: local[*]).
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
...

spark3-submit can be run as expected. Now, let's try to make it work.

Change #1087903 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: render the spark/hadoop/hdfs/yarn configuration files

https://gerrit.wikimedia.org/r/1087903

brouberol opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/912

test_k8s: introduce a DAG submitting a simple Spark job to YARN in cluster mode

Change #1087903 merged by Brouberol:

[operations/deployment-charts@master] airflow: render the spark/hadoop/hdfs/yarn configuration files

https://gerrit.wikimedia.org/r/1087903

BTullis triaged this task as High priority.Nov 8 2024, 10:56 AM

brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/912

test_k8s: introduce a DAG submitting a simple Spark job to YARN in cluster mode

Change #1090468 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: upgrade image

https://gerrit.wikimedia.org/r/1090468

Change #1090468 merged by Brouberol:

[operations/deployment-charts@master] airflow: upgrade image

https://gerrit.wikimedia.org/r/1090468