Maniphest T200732

Upgrade spark 2.3.0 -> 2.3.1 on analytics cluster
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	EBernhardson
	Jul 30 2018, 10:18 PM

Description

There is a bug in spark 2.3.0, SPARK-23729, which breaks the mjolnir deployment. There is a workaround deployed, but it would be nice to bump the minor version so we don't need workarounds.

I have verified by running 2.3.1 on the cluster that the bug we are experiencing is indeed fixed. The simplest repro:

Create a zip file with anything in it named mjolnir_venv.zip
Run pyspark2 --master yarn --archives 'mjolnir_venv.zip#venv'
In the python shell run:

import subprocess
print sc.parallelize([1]).map(lambda x: subprocess.check_output(['ls', '-l']).collect()[0]

On 2.3.0 this prints the following. In particular notice that we have a mjolnir_venv.zip symlink here, rather than the requested rename to venv

total 44                                                                                                                                                                                               [462/1395]
-rw-r--r-- 1 yarn yarn  102 Jul 30 22:03 container_tokens                                                           
-rwx------ 1 yarn yarn  732 Jul 30 22:03 default_container_executor_session.sh                                                                        
-rwx------ 1 yarn yarn  786 Jul 30 22:03 default_container_executor.sh                                                                                                                                           
-rwx------ 1 yarn yarn 6597 Jul 30 22:03 launch_container.sh                                                                                                                                                     
lrwxrwxrwx 1 yarn yarn   89 Jul 30 22:03 mjolnir_venv.zip -> /var/lib/hadoop/data/j/yarn/local/usercache/ebernhardson/filecache/26129/mjolnir_venv.zip                                                        
lrwxrwxrwx 1 yarn yarn   92 Jul 30 22:03 py4j-0.10.6-src.zip -> /var/lib/hadoop/data/f/yarn/local/usercache/ebernhardson/filecache/26130/py4j-0.10.6-src.zip                                                     
lrwxrwxrwx 1 yarn yarn   84 Jul 30 22:03 pyspark.zip -> /var/lib/hadoop/data/g/yarn/local/usercache/ebernhardson/filecache/26131/pyspark.zip                                                                     
lrwxrwxrwx 1 yarn yarn   91 Jul 30 22:03 __spark_conf__ -> /var/lib/hadoop/data/h/yarn/local/usercache/ebernhardson/filecache/26132/__spark_conf__.zip  
lrwxrwxrwx 1 yarn yarn   67 Jul 30 22:03 __spark_libs__ -> /var/lib/hadoop/data/f/yarn/local/filecache/129/spark2-assembly.zip
drwx--x--- 2 yarn yarn 4096 Jul 30 22:03 tmp

When run on 2.3.1 (and prior to 2.3.0) we get the following. Notice here the symlink was appropriately renamed to venv

-rw-r--r-- 1 yarn yarn  102 Jul 30 22:11 container_tokens
-rwx------ 1 yarn yarn  732 Jul 30 22:11 default_container_executor_session.sh
-rwx------ 1 yarn yarn  786 Jul 30 22:11 default_container_executor.sh
-rwx------ 1 yarn yarn 6706 Jul 30 22:11 launch_container.sh
lrwxrwxrwx 1 yarn yarn   92 Jul 30 22:11 py4j-0.10.7-src.zip -> /var/lib/hadoop/data/l/yarn/local/usercache/ebernhardson/filecache/22771/py4j-0.10.7-src.zip
lrwxrwxrwx 1 yarn yarn   84 Jul 30 22:11 pyspark.zip -> /var/lib/hadoop/data/i/yarn/local/usercache/ebernhardson/filecache/22768/pyspark.zip
lrwxrwxrwx 1 yarn yarn   91 Jul 30 22:11 __spark_conf__ -> /var/lib/hadoop/data/j/yarn/local/usercache/ebernhardson/filecache/22769/__spark_conf__.zip
lrwxrwxrwx 1 yarn yarn  110 Jul 30 22:11 __spark_libs__ -> /var/lib/hadoop/data/k/yarn/local/usercache/ebernhardson/filecache/22770/__spark_libs__2999084826558035159.zip
drwx--x--- 2 yarn yarn 4096 Jul 30 22:11 tmp
lrwxrwxrwx 1 yarn yarn   89 Jul 30 22:11 venv -> /var/lib/hadoop/data/h/yarn/local/usercache/ebernhardson/filecache/22767/mjolnir_venv.zip

Details

Subject	Repo	Branch	Lines +/-
Use Oozie REST API to update sharelib for spark2 instead of CLI	operations/puppet	production	+4 -5
Use versionless symlink for spark kernels that use py4j	analytics/jupyterhub/deploy	master	+8 -8
Use versionless symlink for spark kernels that use py4j	analytics/swap/deploy	master	+8 -8

Customize query in gerrit

Event Timeline

EBernhardson created this task.Jul 30 2018, 10:18 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 30 2018, 10:18 PM

Upgrading the version here will likely be pretty easy... we'll triage this task soon.

Milimetric triaged this task as Medium priority.Aug 2 2018, 3:26 PM

Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.

Ottomata claimed this task.Aug 6 2018, 2:33 PM

Ottomata added a project: cloud-services-team (Kanban).

Ottomata edited projects, added Analytics-Kanban; removed cloud-services-team (Kanban).

Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.

deb built:

https://apt.wikimedia.org/wikimedia/pool/main/s/spark2/

Tested in labs with Refine job, works fine. @JAllemandou any objections to upgrading everywhere?

I have not tested but I don't see why it would break :)
Let's go !

Ottomata moved this task from In Progress to Done on the Analytics-Kanban board.Aug 7 2018, 5:08 PM

Change 451068 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/swap/deploy@master] Use versionless symlink for spark kernels that use py4j

https://gerrit.wikimedia.org/r/451068

gerritbot added a project: Patch-For-Review.Aug 7 2018, 6:10 PM

Change 451069 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/jupyterhub/deploy@master] Use versionless symlink for spark kernels that use py4j

https://gerrit.wikimedia.org/r/451069

Change 451068 abandoned by Ottomata:
Use versionless symlink for spark kernels that use py4j

Reason:
wrong repo

https://gerrit.wikimedia.org/r/451068

Change 451069 merged by Ottomata:
[analytics/jupyterhub/deploy@master] Use versionless symlink for spark kernels that use py4j

https://gerrit.wikimedia.org/r/451069

• Nuria closed this task as Resolved.Aug 7 2018, 10:10 PM

• Nuria set the point value for this task to 3.

While i'm not sure what exactly, it seems something might have been missed? Starting an oozie workflow with

<property>
    <name>oozie.action.sharelib.for.spark</name>
    <value>spark2.3.1</value>
</property>

I receive (from https://hue.wikimedia.org/oozie/list_oozie_workflow/0051220-180705103628398-oozie-oozi-W/):

2018-08-09 22:35:09,236  WARN SparkActionExecutor:523 - SERVER[analytics1003.eqiad.wmnet] USER[ebernhardson] GROUP[-] TOKEN[] APP[discovery-transfer_to_es-discovery.popularity_score-2018,8,6->cirrussearch-wf] JOB[0051220-180705103628398-oozie-oozi-W] ACTION[0051220-180705103628398-oozie-oozi-W@transfer] Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.SparkMain], exception invoking main(), java.lang.ClassNotFoundException: Class org.apache.oozie.action.hadoop.SparkMain not found
2018-08-09 22:35:09,236  WARN SparkActionExecutor:523 - SERVER[a,nalytics1003.eqiad.wmnet] USER[ebernhardson] GROUP[-] TOKEN[] APP[discovery-transfer_to_es-discovery.popularity_score-2018,8,6->cirrussearch-wf] JOB[0051220-180705103628398-oozie-oozi-W] ACTION[0051220-180705103628398-oozie-oozi-W@transfer] Launcher exception: java.lang.ClassNotFoundException: Class org.apache.oozie.action.hadoop.SparkMain not found

Changing to 2.3.0 it loads fine. I can't really guess at the difference though. hdfs://analytics-hadoop/user/oozie/share/lib/lib_20170228165236/spark2.3.0 seems to be about the same as the matching spark2.3.1 directory. Maybe there is something in oozie that has to be poked for this to become active?

Change 451857 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use Oozie REST API to update sharelib for spark2 instead of CLI

https://gerrit.wikimedia.org/r/451857

Change 451857 merged by Ottomata:
[operations/puppet@production] Use Oozie REST API to update sharelib for spark2 instead of CLI

https://gerrit.wikimedia.org/r/451857

@EBernhardson try now. The -sharelibupdate command has always been very flaky. Sometimes it just doesn't work, and I dont' know why. This should be automated by puppet, but it didn't work this time. I changed the puppet automation to skip the CLI and use the oozie REST API directly. This seemed to work!

Aklapper removed a project: Analytics.Jul 4 2020, 7:59 AM

Maintenance_bot removed a project: Patch-For-Review.Jul 4 2020, 8:10 AM

Upgrade spark 2.3.0 -> 2.3.1 on analytics clusterClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Details

Event Timeline

Upgrade spark 2.3.0 -> 2.3.1 on analytics cluster
Closed, ResolvedPublic3 Estimated Story Points
Actions