Page MenuHomePhabricator

Upgrade spark 2.3.0 -> 2.3.1 on analytics cluster
Closed, ResolvedPublic3 Estimated Story Points

Description

There is a bug in spark 2.3.0, SPARK-23729, which breaks the mjolnir deployment. There is a workaround deployed, but it would be nice to bump the minor version so we don't need workarounds.

I have verified by running 2.3.1 on the cluster that the bug we are experiencing is indeed fixed. The simplest repro:

  • Create a zip file with anything in it named mjolnir_venv.zip
  • Run pyspark2 --master yarn --archives 'mjolnir_venv.zip#venv'
  • In the python shell run:
import subprocess
print sc.parallelize([1]).map(lambda x: subprocess.check_output(['ls', '-l']).collect()[0]

On 2.3.0 this prints the following. In particular notice that we have a mjolnir_venv.zip symlink here, rather than the requested rename to venv

total 44                                                                                                                                                                                               [462/1395]
-rw-r--r-- 1 yarn yarn  102 Jul 30 22:03 container_tokens                                                           
-rwx------ 1 yarn yarn  732 Jul 30 22:03 default_container_executor_session.sh                                                                        
-rwx------ 1 yarn yarn  786 Jul 30 22:03 default_container_executor.sh                                                                                                                                           
-rwx------ 1 yarn yarn 6597 Jul 30 22:03 launch_container.sh                                                                                                                                                     
lrwxrwxrwx 1 yarn yarn   89 Jul 30 22:03 mjolnir_venv.zip -> /var/lib/hadoop/data/j/yarn/local/usercache/ebernhardson/filecache/26129/mjolnir_venv.zip                                                        
lrwxrwxrwx 1 yarn yarn   92 Jul 30 22:03 py4j-0.10.6-src.zip -> /var/lib/hadoop/data/f/yarn/local/usercache/ebernhardson/filecache/26130/py4j-0.10.6-src.zip                                                     
lrwxrwxrwx 1 yarn yarn   84 Jul 30 22:03 pyspark.zip -> /var/lib/hadoop/data/g/yarn/local/usercache/ebernhardson/filecache/26131/pyspark.zip                                                                     
lrwxrwxrwx 1 yarn yarn   91 Jul 30 22:03 __spark_conf__ -> /var/lib/hadoop/data/h/yarn/local/usercache/ebernhardson/filecache/26132/__spark_conf__.zip  
lrwxrwxrwx 1 yarn yarn   67 Jul 30 22:03 __spark_libs__ -> /var/lib/hadoop/data/f/yarn/local/filecache/129/spark2-assembly.zip
drwx--x--- 2 yarn yarn 4096 Jul 30 22:03 tmp

When run on 2.3.1 (and prior to 2.3.0) we get the following. Notice here the symlink was appropriately renamed to venv

-rw-r--r-- 1 yarn yarn  102 Jul 30 22:11 container_tokens
-rwx------ 1 yarn yarn  732 Jul 30 22:11 default_container_executor_session.sh
-rwx------ 1 yarn yarn  786 Jul 30 22:11 default_container_executor.sh
-rwx------ 1 yarn yarn 6706 Jul 30 22:11 launch_container.sh
lrwxrwxrwx 1 yarn yarn   92 Jul 30 22:11 py4j-0.10.7-src.zip -> /var/lib/hadoop/data/l/yarn/local/usercache/ebernhardson/filecache/22771/py4j-0.10.7-src.zip
lrwxrwxrwx 1 yarn yarn   84 Jul 30 22:11 pyspark.zip -> /var/lib/hadoop/data/i/yarn/local/usercache/ebernhardson/filecache/22768/pyspark.zip
lrwxrwxrwx 1 yarn yarn   91 Jul 30 22:11 __spark_conf__ -> /var/lib/hadoop/data/j/yarn/local/usercache/ebernhardson/filecache/22769/__spark_conf__.zip
lrwxrwxrwx 1 yarn yarn  110 Jul 30 22:11 __spark_libs__ -> /var/lib/hadoop/data/k/yarn/local/usercache/ebernhardson/filecache/22770/__spark_libs__2999084826558035159.zip
drwx--x--- 2 yarn yarn 4096 Jul 30 22:11 tmp
lrwxrwxrwx 1 yarn yarn   89 Jul 30 22:11 venv -> /var/lib/hadoop/data/h/yarn/local/usercache/ebernhardson/filecache/22767/mjolnir_venv.zip

Event Timeline

Upgrading the version here will likely be pretty easy... we'll triage this task soon.

Milimetric triaged this task as Medium priority.Aug 2 2018, 3:26 PM
Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.

deb built:

https://apt.wikimedia.org/wikimedia/pool/main/s/spark2/

Tested in labs with Refine job, works fine. @JAllemandou any objections to upgrading everywhere?

I have not tested but I don't see why it would break :)
Let's go !

Change 451068 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/swap/deploy@master] Use versionless symlink for spark kernels that use py4j

https://gerrit.wikimedia.org/r/451068

Change 451069 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/jupyterhub/deploy@master] Use versionless symlink for spark kernels that use py4j

https://gerrit.wikimedia.org/r/451069

Change 451068 abandoned by Ottomata:
Use versionless symlink for spark kernels that use py4j

Reason:
wrong repo

https://gerrit.wikimedia.org/r/451068

Change 451069 merged by Ottomata:
[analytics/jupyterhub/deploy@master] Use versionless symlink for spark kernels that use py4j

https://gerrit.wikimedia.org/r/451069

Nuria set the point value for this task to 3.

While i'm not sure what exactly, it seems something might have been missed? Starting an oozie workflow with

<property>
    <name>oozie.action.sharelib.for.spark</name>
    <value>spark2.3.1</value>
</property>

I receive (from https://hue.wikimedia.org/oozie/list_oozie_workflow/0051220-180705103628398-oozie-oozi-W/):

2018-08-09 22:35:09,236  WARN SparkActionExecutor:523 - SERVER[analytics1003.eqiad.wmnet] USER[ebernhardson] GROUP[-] TOKEN[] APP[discovery-transfer_to_es-discovery.popularity_score-2018,8,6->cirrussearch-wf] JOB[0051220-180705103628398-oozie-oozi-W] ACTION[0051220-180705103628398-oozie-oozi-W@transfer] Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.SparkMain], exception invoking main(), java.lang.ClassNotFoundException: Class org.apache.oozie.action.hadoop.SparkMain not found
2018-08-09 22:35:09,236  WARN SparkActionExecutor:523 - SERVER[a,nalytics1003.eqiad.wmnet] USER[ebernhardson] GROUP[-] TOKEN[] APP[discovery-transfer_to_es-discovery.popularity_score-2018,8,6->cirrussearch-wf] JOB[0051220-180705103628398-oozie-oozi-W] ACTION[0051220-180705103628398-oozie-oozi-W@transfer] Launcher exception: java.lang.ClassNotFoundException: Class org.apache.oozie.action.hadoop.SparkMain not found

Changing to 2.3.0 it loads fine. I can't really guess at the difference though. hdfs://analytics-hadoop/user/oozie/share/lib/lib_20170228165236/spark2.3.0 seems to be about the same as the matching spark2.3.1 directory. Maybe there is something in oozie that has to be poked for this to become active?

Change 451857 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use Oozie REST API to update sharelib for spark2 instead of CLI

https://gerrit.wikimedia.org/r/451857

Change 451857 merged by Ottomata:
[operations/puppet@production] Use Oozie REST API to update sharelib for spark2 instead of CLI

https://gerrit.wikimedia.org/r/451857

@EBernhardson try now. The -sharelibupdate command has always been very flaky. Sometimes it just doesn't work, and I dont' know why. This should be automated by puppet, but it didn't work this time. I changed the puppet automation to skip the CLI and use the oozie REST API directly. This seemed to work!