Page MenuHomePhabricator

Make Spark 2.1 easily available on new CDH5.10 cluster
Closed, ResolvedPublic8 Estimated Story Points

Description

Spark 2+ is a real improvelent over 1.6, it'd be great if we could have it available, and gently move our jobs to the new APIs.

Loose end TODOs:

  • remove spark2-beeline
  • spark-sql logging is too verbose with provided log4j.properties
  • Make spark2 use hadoop native libs
  • Make a spark2 assembly jar and put hdfs
  • Wikitech documentation
  • email announcement

Event Timeline

+1, i betcha we could just load the jars into hdfs and have a special wrapper script to use them. MAYBE. :)

Nuria triaged this task as Medium priority.Mar 20 2017, 4:07 PM
Nuria added a subscriber: Nuria.

Will help us solve oozie-hive issues with HiveContext (currently we are working around those)

We have work ongoing in T162912 that was initially built against 2.1 with the eventual intent of productization. It uses MLlib, which had several breaking changes from 1.6 to 2.0. I am currently unsure of the exact impact of having to back-port the already-written code.

All versions< 2.2 are affected by security issue, that will be also part of the value of upgrading

Ideally we will get this upgrade with the new cloudera distribution

Discussed in standup 2017-08-31: Let's use scap to deploy spark-2.1.1 release folder (with small changes in config for logging and hadoop-conf setting) on stat100[345] and analytics1003 (for prod jobs).

fdans set the point value for this task to 8.Oct 5 2017, 4:30 PM
fdans moved this task from Operational Excellence Future to Backlog (Later) on the Analytics board.

Change 387663 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/debs/spark2@debian] Initial debian release (2.1.2-bin-hadoop2.6-1)

https://gerrit.wikimedia.org/r/387663

Change 387680 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Install Spark 2 in Hadoop Cluster

https://gerrit.wikimedia.org/r/387680

Change 387663 merged by Ottomata:
[operations/debs/spark2@debian] Initial debian release (2.1.2-bin-hadoop2.6-1)

https://gerrit.wikimedia.org/r/387663

Change 387680 merged by Ottomata:
[operations/puppet@production] Install Spark 2 for Hadoop clients

https://gerrit.wikimedia.org/r/387680

OOooOOO boy!

[@stat1005:/home/otto] $ ls /usr/bin/*spark2* | cat
/usr/bin/pyspark2
/usr/bin/spark2-beeline
/usr/bin/spark2R
/usr/bin/spark2-shell
/usr/bin/spark2-sql
/usr/bin/spark2-submit

I betcha there will be other things that pop up. But, I think I did it!

Remaining TODOs:

  • remove spark2-beeline(?)
  • spark-sql logging is too verbose with provided log4j.properties
  • Make spark2 use hadoop native libs(?)
  • Make a spark2 assembly jar and put hdfs
  • Wikitech documentation
  • email announcement

Change 390435 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/debs/spark2@debian] 2.1.2-2 release for Hadoop 2.6

https://gerrit.wikimedia.org/r/390435

Change 390435 merged by Ottomata:
[operations/debs/spark2@debian] 2.1.2-2 release for Hadoop 2.6

https://gerrit.wikimedia.org/r/390435

Ottomata moved this task from Done to In Code Review on the Analytics-Kanban board.
Ottomata updated the task description. (Show Details)

Change 391028 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Install spark2 on Hadoop workers for use with Oozie

https://gerrit.wikimedia.org/r/391028

Change 391028 merged by Ottomata:
[operations/puppet@production] Install spark2 on Hadoop workers for use with Oozie

https://gerrit.wikimedia.org/r/391028