Upgrade Spark to 2.4.x
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	Groceryheist
	May 1 2019, 4:54 AM

Description

I wrote a pandas udf to solve a problem. There are in Spark 2.4 there are three types of pandas_udfs: SCALAR, GROUPED_MAP, and GROUPED_AGG.

For the problem I was working on today using a GROUPED_AGG would have allowed a simpler and more efficient solution, but I had to go with a GROUPED_MAP since we only have Spark 2.3.1. So it would be nice to have the latest version if feasible.

Details

Subject	Repo	Branch	Lines +/-
Bump refinery-job versions to 0.0.105 for Spark 2.4.4 upgrade	operations/puppet	production	+4 -4
Bump clickstream jar version for spark 2.4.4	analytics/refinery	master	+1 -1
Update oozie jobs to use spark 2.4.4	analytics/refinery	master	+25 -25
Bump spark.version to Spark 2.4.4	analytics/refinery/source	master	+20 -17
Spark 2.4.4 release	operations/debs/spark2	debian	+58 -52
Strip newline off of spark version in fact	operations/puppet	production	+2 -1
Upload versioned Spark assembly file to HDFS	operations/puppet	production	+30 -8
Release Spark 2.4.3	operations/debs/spark2	debian	+114 -88
Release 2.4.3 for Debian Buster	operations/debs/spark2	debian	+19 -9

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		Ottomata	T222253 Upgrade Spark to 2.4.x
Resolved	BUG REPORT	Ottomata	T222254 Pyspark on SWAP: Py4JJavaError: Import Error: no module named pyarrow
Resolved		Ottomata	T222301 Upgrade pandas in spark SWAP notebooks
Resolved		Ottomata	T229347 Rebuild spark2 for Debian Buster

Event Timeline

Groceryheist created this task.May 1 2019, 4:54 AM

Restricted Application added a project: Analytics. · View Herald TranscriptMay 1 2019, 4:54 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Groceryheist mentioned this in T222254: Pyspark on SWAP: Py4JJavaError: Import Error: no module named pyarrow.May 1 2019, 6:48 AM

Ottomata claimed this task.May 2 2019, 4:43 PM

Ottomata triaged this task as Medium priority.

Ottomata raised the priority of this task from Medium to Needs Triage.

Ottomata moved this task from Incoming to Operational Excellence on the Analytics board.

Ottomata added a subtask: T222254: Pyspark on SWAP: Py4JJavaError: Import Error: no module named pyarrow.May 2 2019, 4:48 PM

Ottomata added a subtask: T222301: Upgrade pandas in spark SWAP notebooks.May 2 2019, 4:53 PM

Ottomata triaged this task as Medium priority.May 2 2019, 5:10 PM

Ottomata added a project: Analytics-Kanban.May 14 2019, 4:15 PM

Ottomata merged a task: T215043: Upgrade to Spark 2.4.0.May 17 2019, 4:59 PM

Ottomata added subscribers: elukey, Ottomata, JAllemandou, MoritzMuehlenhoff.

Rats.

Spark 2.4.3 uses Arrow 0.10. pyarrow 0.10 has an issue where it builds against an older (1.10?) numpy version. We have 1.12 from Debian Stretch installed.

We might need to wait for Buster to do this, or backport thee Buster numpy and pandas libs for Stretch. I'd imagine that isn't as easy as it sounds, as numpy and pandas have a lot of dependencies.

@elukey, @MoritzMuehlenhoff it seems like Buster is needed more and more these days. What's the status? :)

Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.May 17 2019, 7:05 PM

This only is a problem with pyarrow; we could upgrade spark to 2.4.3, but pyarrow wouldn't work.

Ottomata renamed this task from Upgrade Spark to 2.4.2 to Upgrade Spark to 2.4.x.May 17 2019, 7:40 PM

In T222253#5191868, @Ottomata wrote:

@elukey, @MoritzMuehlenhoff it seems like Buster is needed more and more these days. What's the status? :)

Shouldn't take much longer, maybe 4-6 weeks.

Ottomata moved this task from In Progress to Paused on the Analytics-Kanban board.Jun 5 2019, 8:22 PM

Change 526527 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/debs/spark2@debian] Release 2.4.3 for Debian Buster

https://gerrit.wikimedia.org/r/526527

gerritbot added a project: Patch-For-Review.Jul 30 2019, 8:58 PM

Ottomata mentioned this in T229347: Rebuild spark2 for Debian Buster.Jul 30 2019, 9:01 PM

Ottomata added a subtask: T229347: Rebuild spark2 for Debian Buster.Jul 30 2019, 9:07 PM

Ottomata moved this task from Paused to In Progress on the Analytics-Kanban board.Aug 6 2019, 2:48 PM

Change 526527 abandoned by Ottomata:
Release 2.4.3 for Debian Buster

https://gerrit.wikimedia.org/r/526527

Maintenance_bot removed a project: Patch-For-Review.Aug 7 2019, 5:11 PM

Change 532455 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/debs/spark2@debian] Release Spark 2.4.3

https://gerrit.wikimedia.org/r/532455

gerritbot added a project: Patch-For-Review.Aug 26 2019, 8:37 PM

Based on the work in T229347, I built a Spark 2.4.3 .deb and tried it on stat1005. Since spark.yarn.archive is set in spark-defaults.conf to the 2.3.1 version, I need to set this manually:

PYSPARK_PYTHON=python3.7 JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 pyspark2 --master yarn --conf spark.yarn.archive=hdfs:///user/spark/share/lib/spark-2.4.3-assembly.zip

For some reason, the pyspark2 numpy test didn't work. Everything seemed to work, but nothing happened. I downgraded back to 2.3.1. We can investigate later.

Ottomata mentioned this in T231067: Install Debian Buster on Hadoop.Aug 28 2019, 7:55 PM

Ah ha, my previous test didn't work because I hadn't distributed the pyspark 2.4.3 deps anywhere, and it was loading the old ones. My process of shipping the python deps to the worker filesystem with the debs is good, but it would also be nice to have a 'python assembly' that is useable from HDFS. That way we could more easily test upgrades by allowing YARN workers to use deps in HDFS if specified.

Anyway, zipping up the Spark 2.4.3 python3.7 deps and running with Java 8 from stat1005 seems to work. Testing Java 11 will be much harder, unless it is available on all the workers.

Spark 2.4.3 also works just fine in YARN from stat1007 with Java 8 and Python 3.5.

I'm going to build a new .deb that includes zipped up python dependencies as a file, that can be used as --archives.

• Nuria closed subtask T222301: Upgrade pandas in spark SWAP notebooks as Resolved.Aug 29 2019, 1:15 PM

• Nuria closed subtask T229347: Rebuild spark2 for Debian Buster as Resolved.Aug 29 2019, 1:20 PM

• Nuria closed subtask T222254: Pyspark on SWAP: Py4JJavaError: Import Error: no module named pyarrow as Resolved.Sep 5 2019, 5:48 PM

Ottomata moved this task from In Progress to Paused on the Analytics-Kanban board.Sep 12 2019, 5:04 PM

• fdans removed a project: Analytics-Kanban.Sep 12 2019, 5:09 PM

Ottomata added a project: Analytics-Kanban.Oct 10 2019, 2:11 PM

@JAllemandou let's make this happen! The .deb is ready to go! :)

Change 532455 merged by Ottomata:
[operations/debs/spark2@debian] Release Spark 2.4.3

https://gerrit.wikimedia.org/r/532455

Maintenance_bot removed a project: Patch-For-Review.Oct 10 2019, 7:10 PM

Confirmed Spark 2.4.4 works with Refine in local mode with existing refinery-job jar (compiled with 2.3.1) and a new refinery-job compiled with 2.4.4.

Running in YARN seems to fail, I think due to mismatched spark shuffle service versions.

Change 542225 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/debs/spark2@debian] Spark 2.4.4 release

https://gerrit.wikimedia.org/r/542225

Change 542226 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] Bump spark.version to Spark 2.4.4 in pom.xml

https://gerrit.wikimedia.org/r/542226

@JAllemandou what do we need to do to test other Refinery jobs, mostly just test mw history somehow?

There are a couple of jobs I'd like to check (mediawiki-history, checker, mobile-app-session jobs and wikidata jobs). If we can't run in yarn, I'll do smaller versions in local.

@elukey mind if we upgrade to Spark 2.4.4 in the analytics test cluster and do some tests there?

In T222253#5577341, @Ottomata wrote:

@elukey mind if we upgrade to Spark 2.4.4 in the analytics test cluster and do some tests there?

Please go ahead :)

Change 543474 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Upload versioned Spark assembly file to HDFS

https://gerrit.wikimedia.org/r/543474

Change 543474 merged by Ottomata:
[operations/puppet@production] Upload versioned Spark assembly file to HDFS

https://gerrit.wikimedia.org/r/543474

Change 543690 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Strip newline off of spark version in fact

https://gerrit.wikimedia.org/r/543690

Change 543690 merged by Ottomata:
[operations/puppet@production] Strip newline off of spark version in fact

https://gerrit.wikimedia.org/r/543690

@elukey, I think we need either an hdfs or a spark keytab.

https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/hadoop/spark2.pp#L156-L173

Ah right, we have an hdfs keytab on hadoop masters and workers, just not coordinator.

I ran

sudo -u hdfs /usr/local/bin/kerberos-run-command hdfs ./spark2_upload_assembly.sh

on analytics1038 and it worked. I can't run this on a hadoop master since spark isn't installed there. We should either just pick a single worker node on which to run this command, (by setting profile::hadoop::spark2::install_assembly in hiera), or create a spark keytab for coordinator and run the command as spark.

We just need an hdfs dfs -put command to be able to write to /user/spark/share/lib.

@JAllemandou FYI spark is upgraded to 2.4.4 in test cluster. shuffle service restart applied there too.

Tests successfully run (one change needed in mediawiki-history as spark 2.4 has builtin avro instead of through external package):

Local mode on smaller data from prod cluster
- mediawiki-history
- wikidata-articleplaceholder
- MobileAppsSessions
Yarn mode on test cluster
- sql queries
- Parquet queries

It all looks good except for logging on test cluster being too verbose :)

Change 542225 merged by Ottomata:
[operations/debs/spark2@debian] Spark 2.4.4 release

https://gerrit.wikimedia.org/r/542225

Change 548488 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Bump refinery-job versions to 0.0.105 for Spark 2.4.4 upgrade

https://gerrit.wikimedia.org/r/548488

Change 548494 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Update oozie jobs to use spark 2.4.4

https://gerrit.wikimedia.org/r/548494

Plan here: https://etherpad.wikimedia.org/p/analytics-spark

In T222253#5633384, @Ottomata wrote:

Plan here: https://etherpad.wikimedia.org/p/analytics-spark

Left some nits but looks good!

Change 542226 merged by jenkins-bot:
[analytics/refinery/source@master] Bump spark.version to Spark 2.4.4

https://gerrit.wikimedia.org/r/542226

Change 548494 merged by Joal:
[analytics/refinery@master] Update oozie jobs to use spark 2.4.4

https://gerrit.wikimedia.org/r/548494

Change 548867 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Bump clickstream jar version for spark 2.4.4

https://gerrit.wikimedia.org/r/548867

Change 548867 merged by Ottomata:
[analytics/refinery@master] Bump clickstream jar version for spark 2.4.4

https://gerrit.wikimedia.org/r/548867

Mentioned in SAL (#wikimedia-analytics) [2019-11-05T20:12:08Z] <ottomata> stopped refine jobs for Spark 2.4 upgrade - T222253

Mentioned in SAL (#wikimedia-analytics) [2019-11-05T20:21:23Z] <ottomata> install spark 2.4.4-bin-hadoop2.6-1 cluster wide using debdeploy - T222253

Change 548488 merged by Ottomata:
[operations/puppet@production] Bump refinery-job versions to 0.0.105 for Spark 2.4.4 upgrade

https://gerrit.wikimedia.org/r/548488

Ottomata moved this task from In Progress to Done on the Analytics-Kanban board.Nov 5 2019, 9:17 PM

Ottomata set the point value for this task to 8.

• Nuria closed this task as Resolved.Nov 7 2019, 11:07 PM

Aklapper removed a project: Analytics.Jul 4 2020, 7:59 AM

jbond reopened subtask T229347: Rebuild spark2 for Debian Buster as Open.Feb 23 2023, 12:50 PM

Maintenance_bot removed a project: Patch-For-Review.Feb 23 2023, 1:11 PM

BTullis closed subtask T229347: Rebuild spark2 for Debian Buster as Resolved.Mar 29 2023, 10:03 AM

Upgrade Spark to 2.4.xClosed, ResolvedPublic8 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Upgrade Spark to 2.4.x
Closed, ResolvedPublic8 Estimated Story Points
Actions

Related Objects
Search...