Page MenuHomePhabricator

Run Spark Connect server in Analytics cluster
Closed, DeclinedPublic

Description

We would like to use the thin client for Spark, in order to write data from an Elixir application. It may be possible to do this by starting a Spark Connect server in yarn.

Demonstrate that Spark Connect can be run on Analytics cluster

References:
https://wikitech.wikimedia.org/wiki/HTTP_proxy#Maven_proxy_configuration_example

Experimental commands:

source /opt/conda-analytics/etc/profile.d/conda.sh
conda create -n spark34 python=3.10.8 pyspark=3.4.1 conda-pack=0.7.0 ipython jupyterlab=3.4.8 jupyterhub-singleuser=1.5.0 urllib3=1.26.11
conda activate spark34
pip install grpcio==1.48.1 protobuf grpcio-status

wget https://archive.apache.org/dist/spark/spark-3.4.1/spark-3.4.1.tgz

Attempt to fetch the spark-connect jar and stuff it into hdfs. (Update: compiled with the wrong version of Java)

spark3-submit --conf spark.jars.ivySettings=/etc/maven/ivysettings.xml --master yarn --packages=org.apache.spark:spark-connect_2.12:3.4.1 org.apache.spark.sql.connect.service.SparkConnectServer
hdfs dfs -put ~/.ivy2/jars/org.apache.spark_spark-connect_2.12-3.4.1.jar /user/awight/org.apache.spark_spark-connect_2.12-3.4.1.jar

Runs the service but eventually fails because of the version mismatch:

spark3-submit --master yarn --class org.apache.spark.sql.connect.service.SparkConnectServer hdfs:///user/awight/org.apache.spark_spark-connect_2.12-3.4.1.jar

error:

Exception in thread "main" java.lang.NoSuchMethodError: io.netty.buffer.PooledByteBufAllocator.<init>(ZIIIIIIZ)V

Trying to compile spark locally:

mvn package

Run Spark Connect for local development

TODO: not working yet

docker run -it --network spark --name spark-master -p 15002:15002 -p 8082:8080 spark:3.4.1-scala2.12-java11-ubuntu bash -c "/opt/spark/sbin/start-master.sh; tail -f /dev/null"

But this fails to override ivy cache config:

docker exec -it spark-master bash -c "../sbin/start-connect-server.sh -Divy.home=/tmp/ivy2 `pwd` --packages=org.apache.spark:spark-connect_2.12:3.4.1"

Write Elixir adapter

We'll also need the Elixir glue to call Spark Connect, WIP in https://gitlab.com/wmde/technical-wishes/apache_spark_connect_ex

Event Timeline

Is it possible that the proxy is blocking maven?

Did you run set_proxy first?

After discussion with @xcollazo , I'll take a simpler path and write files to a temporary filesystem. Will be described in a new task...