Page MenuHomePhabricator

Repackage spark without hadoop, use provided hadoop jars
Closed, ResolvedPublic

Description

In https://phabricator.wikimedia.org/T273711#6817104 we encountered a problem where we had old hadoop versions being used by Spark jobs, even though our cluster's hadoop versions had been upgraded. This lead to strange issues like https://www.irccloud.com/pastebin/qfy1lpD8/.

We should stop packaging hadoop dependencies with our installed spark distribution, and instead always use the cluster provided ones.

https://spark.apache.org/docs/latest/hadoop-provided.html

To do this, we need to rebuild the spark debian package based on the hadoop-less spark tarball and recreate and re-upload the spark-2.4.4-assembly.jar without hadoop.

Event Timeline

Some findings:

  • spark-2.4.4-bin-without-hadoop.tgz does not include any Hive support. It also seems to be missing some parts of spark-hadoop specific packages that I think we need, e.g. spark yarn shuffle and spark-yarn.
  • spark-2.4.4-bin-hadoop2.7.tgz does not work in either, I get the same error and mismatched hadoop versions in the stack trace. I had hoped this maybe would just work, since the Spark docs say Apache Hadoop 2.7.X and later'

I think we have two options left.

  1. Use spark-2.4.4-bin-hadoop2.7.tgz but manually remove all of the hadoop-* jars. This is essentially the same workaround I've got for Refine right now where I am setting spark.yarn.archive to my custom build spark-assembly.zip file with the hadoop-* jars removed.
  1. Build Spark 2.4.4. from source with -Phadoop-provided. I've almost gotten this to work, but I can't seem to build with -Psparkr. Also, with -Phadoop-provided, we end up missing some of the same spark yarn related jars like spark yarn shuffle and spark-yarn.

Option 2 is probably the 'right' thing to do, and would allow us to create a real debian source package, rather than using the binary tarball. Option 1. is probably the easier and least risky thing to do, as it would reduce surface of changes we make.

I'm inclined to try Option 1 for now, and revisit our Spark packaging when we upgrade to Spark 3. Perhaps we can even use BigTop at that time?

Change 664922 had a related patch set uploaded (by Ottomata; owner: Andrew Otto):
[operations/debs/spark2@debian] Spark 2.4.4 with manually included Hadoop 2.10.1 dependencies.

https://gerrit.wikimedia.org/r/664922

Actually, at the moment I am pursuing option 3.

  1. Use spark-2.4.4-bin-hadoop2.6.tgz but remove the Hadoop 2.6 jars and manually include the Hadoop 2.10.1 jars.

I think this will be fewer changes at this time, keeping the (correct version) Hadoop deps included with Spark as we have done so far.

The tricky part is, I don't know how to test this well beyond installing it!

I have tested Refine using the spark assembly jar built this way, and it works just fine. If other than that Spark is working now, I'd expect it continue to work if I install this package everywhere in the cluster. Everything is the same with this new .deb except for the hadoop-* jars, which are the same ones included on the cluster now.

@elukey, if you are ok with this, I'm inclined to try installing this .deb in the whole cluster, and replacing the spark-2.4.4-assembly.jar. I will not have time for this tomorrow, and Friday is bad, so perhaps Monday?

Change 664922 merged by Ottomata:
[operations/debs/spark2@debian] Spark 2.4.4 with manually included Hadoop 2.10.1 dependencies.

https://gerrit.wikimedia.org/r/664922

Ok, Option 3. is not looking good. I installed spark2 with my manually added Hadoop 2.10.1 jars on an-test-client1001, and I can't start spark locally. I get Exception in thread "main" java.lang.NoClassDefFoundError: com/ctc/wstx/io/InputBootstrapper. However, if I do export SPARK_DIST_CLASSPATH=$(hadoop classpath), it works.

So, it looks as if my manual replacement was too hacky for Spark. I'm going to try Option 1.

Mentioned in SAL (#wikimedia-analytics) [2021-02-19T15:43:17Z] <ottomata> installing spark 2.4.4 without hadoop jars on analytics test cluster - T274384

Change 665359 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Update test cluster refine jobs to use event platform schemas

https://gerrit.wikimedia.org/r/665359

Change 665359 merged by Ottomata:
[operations/puppet@production] Update test cluster refine jobs to use event platform schemas

https://gerrit.wikimedia.org/r/665359

Change 665362 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/debs/spark2@debian] Spark 2.4.4 with Hadoop jars removed

https://gerrit.wikimedia.org/r/665362

Change 665362 merged by Ottomata:
[operations/debs/spark2@debian] Spark 2.4.4 with Hadoop jars removed

https://gerrit.wikimedia.org/r/665362

sudo -u hdfs hdfs dfs -mv /user/spark/share/lib/spark-2.4.4-assembly.zip /user/spark/share/lib/spark-2.4.4-hadoop2.6-assembly.zip

I upgraded spark on an-coord1001, which is the only node that has profile::hadoop::spark2::install_assembly: true. Re-runningg puppet there got me:

-rw-r--r--   3 hdfs  spark  195258057 2021-02-22 14:14 /user/spark/share/lib/spark-2.4.4-assembly.zip

Which is the same one installed with the new package:

14:12:42 [@stat1004:/home/otto] $ ls -l /usr/lib/spark2/spark-2.4.4-assembly.zip
-rw-r--r-- 1 root root 195258057 Feb 19 03:23 /usr/lib/spark2/spark-2.4.4-assembly.zip

Change 666127 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Refine - re-add accidentally removed spark conf from last patch

https://gerrit.wikimedia.org/r/666127

Change 666127 merged by Ottomata:
[operations/puppet@production] Refine - re-add accidentally removed spark conf from last patch

https://gerrit.wikimedia.org/r/666127

Mentioned in SAL (#wikimedia-analytics) [2021-02-22T14:38:46Z] <ottomata> upgrade spark2 on analytics cluster to 2.4.4-bin-hadoop2.6-5~wmf0 (hadoop jars removed) - T274384

sudo -u hdfs hdfs dfs  -mv /user/oozie/share/lib/lib_20210210190411/spark-2.4.4 /tmp/oozie-sharelib-spark-2.4.4-hadoop.2.6

After upgrading spark2 everywhere, running puppet on an-coord1001 got me:

drwxr-xr-x   - oozie hadoop          0 2021-02-22 14:47 /user/oozie/share/lib/lib_20210210190411/spark-2.4.4

with no hadoop jars inside.

Ottomata added a project: Analytics-Kanban.
Ottomata moved this task from Next Up to Done on the Analytics-Kanban board.

Mentioned in SAL (#wikimedia-analytics) [2021-02-22T19:27:37Z] <ottomata> restart oozie on an-coord1001 to pick up new spark share lib without hadoop jars - T274384

Change 666214 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery@master] bin/camus - use hadoop classpath when running checker jar.

https://gerrit.wikimedia.org/r/666214

Change 666214 merged by Milimetric:
[analytics/refinery@master] bin/camus - use hadoop classpath when running checker jar.

https://gerrit.wikimedia.org/r/666214