Our YARN nodes currently have two shuffler services available
- mapreduce_shuffle
- spark_shuffle
The mapreduce_shuffle service is not affected by this, but we need to address the upgrade of the spark_shuffler service to version 3.
There is further information about the shuffler service here: https://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service
The spark_shuffle service is enabled by the use of the following property in /etc/hadoop/conf/yarn-site.xml
<% if @yarn_use_spark_shuffle -%> <property> <name>yarn.nodemanager.aux-services.spark_shuffle.class</name> <value>org.apache.spark.network.yarn.YarnShuffleService</value> </property> <% end -%>
However, the jar file that provides this spark_shuffle service is still the one that was shipped with our spark2 distribution.
btullis@an-worker1078:~$ find /usr/lib/hadoop-yarn/lib/ -name '*shuffle*' -ls 1332949 0 lrwxrwxrwx 1 root root 49 Mar 5 2021 /usr/lib/hadoop-yarn/lib/spark2-yarn-shuffle.jar -> /usr/lib/spark2/yarn/spark-2.4.4-yarn-shuffle.jar
We have a script in puppet that was used to make the spark2 shuffler jar file available to YARN nodemanagers.
https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/files/hadoop/spark2/spark2_yarn_shuffle_jar_install.sh
A similar script was prepared ready for the spark3 upgrade:
https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/files/hadoop/spark3/spark3_yarn_shuffle_jar_install.sh#L4
However, this script will not work because our distribution of spark3 is obtained from conda-forge:
https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/blob/main/conda-environment.yml#L11-12
Unfortunately, it seems that this version was not built with the YARN profile enabled, so there is no spark-3.1.2-yarn-shuffle.jar file available on the workers for us to symlink.
btullis@an-worker1078:~$ find /usr/lib/spark2/yarn /usr/lib/spark2/yarn /usr/lib/spark2/yarn/spark-2.4.4-yarn-shuffle.jar btullis@an-worker1078:~$ find /usr/lib/spark3/yarn find: '/usr/lib/spark3/yarn': No such file or directory
We have various options:
- Download the jar manually and store it in Archiva, deploying it to the workers from there
- Rebuild our spark distribution with the -Pyarn option