Page MenuHomePhabricator

Upgrade Spark to 2.4.x
Closed, ResolvedPublic8 Estimated Story Points

Description

I wrote a pandas udf to solve a problem. There are in Spark 2.4 there are three types of pandas_udfs: SCALAR, GROUPED_MAP, and GROUPED_AGG.

For the problem I was working on today using a GROUPED_AGG would have allowed a simpler and more efficient solution, but I had to go with a GROUPED_MAP since we only have Spark 2.3.1. So it would be nice to have the latest version if feasible.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Ottomata triaged this task as Medium priority.
Ottomata raised the priority of this task from Medium to Needs Triage.
Ottomata moved this task from Incoming to Operational Excellence on the Analytics board.
Ottomata triaged this task as Medium priority.May 2 2019, 5:10 PM

Rats.

Spark 2.4.3 uses Arrow 0.10. pyarrow 0.10 has an issue where it builds against an older (1.10?) numpy version. We have 1.12 from Debian Stretch installed.

We might need to wait for Buster to do this, or backport thee Buster numpy and pandas libs for Stretch. I'd imagine that isn't as easy as it sounds, as numpy and pandas have a lot of dependencies.

@elukey, @MoritzMuehlenhoff it seems like Buster is needed more and more these days. What's the status? :)

This only is a problem with pyarrow; we could upgrade spark to 2.4.3, but pyarrow wouldn't work.

Ottomata renamed this task from Upgrade Spark to 2.4.2 to Upgrade Spark to 2.4.x.May 17 2019, 7:40 PM

@elukey, @MoritzMuehlenhoff it seems like Buster is needed more and more these days. What's the status? :)

Shouldn't take much longer, maybe 4-6 weeks.

Change 526527 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/debs/spark2@debian] Release 2.4.3 for Debian Buster

https://gerrit.wikimedia.org/r/526527

Change 526527 abandoned by Ottomata:
Release 2.4.3 for Debian Buster

https://gerrit.wikimedia.org/r/526527

Change 532455 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/debs/spark2@debian] Release Spark 2.4.3

https://gerrit.wikimedia.org/r/532455

Based on the work in T229347, I built a Spark 2.4.3 .deb and tried it on stat1005. Since spark.yarn.archive is set in spark-defaults.conf to the 2.3.1 version, I need to set this manually:

PYSPARK_PYTHON=python3.7 JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 pyspark2 --master yarn --conf spark.yarn.archive=hdfs:///user/spark/share/lib/spark-2.4.3-assembly.zip

For some reason, the pyspark2 numpy test didn't work. Everything seemed to work, but nothing happened. I downgraded back to 2.3.1. We can investigate later.

Ah ha, my previous test didn't work because I hadn't distributed the pyspark 2.4.3 deps anywhere, and it was loading the old ones. My process of shipping the python deps to the worker filesystem with the debs is good, but it would also be nice to have a 'python assembly' that is useable from HDFS. That way we could more easily test upgrades by allowing YARN workers to use deps in HDFS if specified.

Anyway, zipping up the Spark 2.4.3 python3.7 deps and running with Java 8 from stat1005 seems to work. Testing Java 11 will be much harder, unless it is available on all the workers.

Spark 2.4.3 also works just fine in YARN from stat1007 with Java 8 and Python 3.5.

I'm going to build a new .deb that includes zipped up python dependencies as a file, that can be used as --archives.

@JAllemandou let's make this happen! The .deb is ready to go! :)

Change 532455 merged by Ottomata:
[operations/debs/spark2@debian] Release Spark 2.4.3

https://gerrit.wikimedia.org/r/532455

Confirmed Spark 2.4.4 works with Refine in local mode with existing refinery-job jar (compiled with 2.3.1) and a new refinery-job compiled with 2.4.4.

Running in YARN seems to fail, I think due to mismatched spark shuffle service versions.

Change 542225 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/debs/spark2@debian] Spark 2.4.4 release

https://gerrit.wikimedia.org/r/542225

Change 542226 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] Bump spark.version to Spark 2.4.4 in pom.xml

https://gerrit.wikimedia.org/r/542226

@JAllemandou what do we need to do to test other Refinery jobs, mostly just test mw history somehow?

There are a couple of jobs I'd like to check (mediawiki-history, checker, mobile-app-session jobs and wikidata jobs). If we can't run in yarn, I'll do smaller versions in local.

@elukey mind if we upgrade to Spark 2.4.4 in the analytics test cluster and do some tests there?

@elukey mind if we upgrade to Spark 2.4.4 in the analytics test cluster and do some tests there?

Please go ahead :)

Change 543474 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Upload versioned Spark assembly file to HDFS

https://gerrit.wikimedia.org/r/543474

Change 543474 merged by Ottomata:
[operations/puppet@production] Upload versioned Spark assembly file to HDFS

https://gerrit.wikimedia.org/r/543474

Change 543690 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Strip newline off of spark version in fact

https://gerrit.wikimedia.org/r/543690

Change 543690 merged by Ottomata:
[operations/puppet@production] Strip newline off of spark version in fact

https://gerrit.wikimedia.org/r/543690

Ah right, we have an hdfs keytab on hadoop masters and workers, just not coordinator.

I ran

sudo -u hdfs /usr/local/bin/kerberos-run-command hdfs ./spark2_upload_assembly.sh

on analytics1038 and it worked. I can't run this on a hadoop master since spark isn't installed there. We should either just pick a single worker node on which to run this command, (by setting profile::hadoop::spark2::install_assembly in hiera), or create a spark keytab for coordinator and run the command as spark.

We just need an hdfs dfs -put command to be able to write to /user/spark/share/lib.

@JAllemandou FYI spark is upgraded to 2.4.4 in test cluster. shuffle service restart applied there too.

Tests successfully run (one change needed in mediawiki-history as spark 2.4 has builtin avro instead of through external package):

  • Local mode on smaller data from prod cluster
    • mediawiki-history
    • wikidata-articleplaceholder
    • MobileAppsSessions
  • Yarn mode on test cluster
    • sql queries
    • Parquet queries

It all looks good except for logging on test cluster being too verbose :)

Change 542225 merged by Ottomata:
[operations/debs/spark2@debian] Spark 2.4.4 release

https://gerrit.wikimedia.org/r/542225

Change 548488 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Bump refinery-job versions to 0.0.105 for Spark 2.4.4 upgrade

https://gerrit.wikimedia.org/r/548488

Change 548494 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Update oozie jobs to use spark 2.4.4

https://gerrit.wikimedia.org/r/548494

Change 542226 merged by jenkins-bot:
[analytics/refinery/source@master] Bump spark.version to Spark 2.4.4

https://gerrit.wikimedia.org/r/542226

Change 548494 merged by Joal:
[analytics/refinery@master] Update oozie jobs to use spark 2.4.4

https://gerrit.wikimedia.org/r/548494

Change 548867 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Bump clickstream jar version for spark 2.4.4

https://gerrit.wikimedia.org/r/548867

Change 548867 merged by Ottomata:
[analytics/refinery@master] Bump clickstream jar version for spark 2.4.4

https://gerrit.wikimedia.org/r/548867

Mentioned in SAL (#wikimedia-analytics) [2019-11-05T20:12:08Z] <ottomata> stopped refine jobs for Spark 2.4 upgrade - T222253

Mentioned in SAL (#wikimedia-analytics) [2019-11-05T20:21:23Z] <ottomata> install spark 2.4.4-bin-hadoop2.6-1 cluster wide using debdeploy - T222253

Change 548488 merged by Ottomata:
[operations/puppet@production] Bump refinery-job versions to 0.0.105 for Spark 2.4.4 upgrade

https://gerrit.wikimedia.org/r/548488

Ottomata set the point value for this task to 8.