Page MenuHomePhabricator

Rebuild spark2 for Debian Buster
Closed, ResolvedPublic8 Estimated Story Points

Description

The spark2 deb package needs to be rebuilt for Debian Buster (first hosts that needs it is stat1005).

Event Timeline

elukey triaged this task as Medium priority.Jul 30 2019, 3:24 PM
elukey created this task.

Change 526527 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/debs/spark2@debian] Release 2.4.3 for Debian Buster

https://gerrit.wikimedia.org/r/526527

Since we've also got T222253: Upgrade Spark to 2.4.x waiting for buster, I went ahead and built a buster spark at 2.4.3.

I've dpkg -i installed this on stat1005. Please test and see. pyarrow stuff looks ok:

pyspark2
...

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.3
      /_/

Using Python version 3.7.3 (default, Apr  3 2019 05:39:12)
SparkSession available as 'spark'.

In [1]: import pyarrow

In [2]: pyarrow.__version__
Out[2]: '0.11.0'

Luca, do you think we should put this in apt for buster? Even if we start upgrading Hadoop workers to Buster and install this package, spark jobs should continue to use the version of spark that is either installed where the job is launched; or the version of spark in spark-assembly.zip in HDFS (which we have to update manually anyway).

Even if we start upgrading Hadoop workers to Buster and install this package, spark jobs should continue to use the version of spark that is either installed where the job is launched; or the version of spark in spark-assembly.zip in HDFS

T222254 makes me realize this is not entirely true! It is true of all .jar dependencies, but not true of the python ones! The python deps are not in the spark-assembly.zip, nor are they auto shipped when doing pyspark2 --master yarn, so whatever is available on the workers is what is used, I think. Grr.

In order to make a cluster buster (oh now that's fun!) upgrade easier, I'll also build a Spark 2.3.0 .deb for buster.

@elukey btw, it looks like the default Java on Buster is Java 11! This doesn't work with Spark. openjdk-8 is still available tho, so I if install that and set JAVA_HOME appropriately Spark works.

Another gotcha:

Exception: Python in worker has different version 3.5 than that in driver 3.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

I think we can't use Spark in Buster without upgrading the whole cluster at once.

Another gotcha:

Exception: Python in worker has different version 3.5 than that in driver 3.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

I think we can't use Spark in Buster without upgrading the whole cluster at once.

Maybe we could use the pyall stretch component (https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/480041/) to add python3.7 to the stretch workers and clients, and then see if it works with a buster client node.. If so we could proceed with an incremental buster migration (in theory).

Oh, cool! This makes all 3 of those py3 versions available? Yeah for sure
we should do this. As long as default python3 stays the same, this is
safe! Awesome!

@Ottomata there is a caveat though: all the python libraries have only the version in debian, so changing the version of the interpreter might be a problem.. We'll need to test!

Hm indeed. For the Spark case, we might need to also include all the binary deps of pyarrow (e.g. numpy) for the python version in the deb package. I.e. include both py35 and py37 wheels, and then somehow select the proper deps when running a particular python version. Pretty gross.

Change 528492 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Install pyall package on analytics cluster

https://gerrit.wikimedia.org/r/528492

Change 528492 merged by Ottomata:
[operations/puppet@production] Install python3.7 package on analytics cluster

https://gerrit.wikimedia.org/r/528492

Change 528509 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Declare python3.7 package explicity to avoid circular dependency

https://gerrit.wikimedia.org/r/528509

Change 528509 merged by Ottomata:
[operations/puppet@production] Declare python3.7 package explicity to avoid circular dependency

https://gerrit.wikimedia.org/r/528509

Writing this down for future me:

I have 2 goals.

  1. Get Spark scala and pyspark 2.3.0 with pyarrow to work in a heterogeneous Stretch + Buster environment.
  2. Get Spark 2.4.3 to work (in Buster).

To do either of these, I need to be able to launch Spark or PySpark with bundled dependencies from the client environment. We won't be able to rely on the proper dependencies being deployed to all nodes. I've got python 3.7 on all nodes now, so If I can run pyspark with pyarrow and numpy from a Buster node (stat1005), I will have achieved my goal.

For just Spark Scala, this actually is pretty easy. Everything should work as is now with Spark 2.3.0.

For PySpark, I need to set spark.pyspark.python=/usr/bin/python3.7, and then ship all dependencies from the client. The dependencies installed on stretch will be for a different python version than the ones installed for Buster.

Change 528588 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Install libpython3.7 with python3.7 on analytics nodes

https://gerrit.wikimedia.org/r/528588

Hm, more thoughts. Even if I figure out ^, I'm not sure it allow for an incremental upgrade to Buster.

If we upgrade workers first, the Buster workers will only work with python3.7. Clients that submit pyspark jobs using python3.5 from stretch will fail on the Buster workers. Clients that submit pyspark jobs using python3.7 (from Buster or from pyall component) will fail on Stretch workers.

If we upgrade clients first, mayyyybe I can get something working where the python3.7 dependencies are always shipped along with the spark job to workers, and python3.7 is used on Stretch nodes from pyall component. I'm working on this now, but if it even works, I think I'll only be able to automatically ship the direct dependencies of pyspark (e.g. pyarrow, numpy, etc.). I don't want to ship custom python3.7 versions of the packages we are already deploying via apt in e.g. profile::analytics::cluster::packages::common. If the pyspark yarn jobs that folks run use any of these dependencies (e.g. python3-sklearn, or python3-requests, etc.) they'll have to download and package their wheels for python3.7 themselves and ship them.

Change 528588 merged by Ottomata:
[operations/puppet@production] Install libpython3.7 with python3.7 on analytics nodes

https://gerrit.wikimedia.org/r/528588

mayyyybe I can get something working where the python3.7 dependencies are always shipped along with the spark job to workers

JAW DROP...I got this to work.

  1. Get all dependent wheels using pip. pip wheel --wheel-dir ./wheeldir pyspark[sql]==2.3.1
  2. remove ./wheeldir/pyspark*.whl (we don't need this, as pyspark is none-any and the 2.3.1 version from the spark2 dist is the same everywhere)
  3. Add .whl files in debian/extra/python
  4. Build buster version of spark2 debian package
  5. Install buster version of spark2 debian package on a buster client node (stat1005)
  6. Zip up all python deps: cd /usr/lib/spark2/python; zip -r ./* ~/pyspark-assembly.zip
  7. Launch pyspark2 with:
pyspark2 --master yarn \
  --conf "spark.pyspark.python=/usr/bin/python3.7" \
  --conf 'spark.executorEnv.PYTHONPATH=pyspark-assembly' \
  --archives pyspark-assembly.zip#pyspark-assembly

Test with

# test numpy:
import numpy as np
rdd = sc.parallelize([np.array([1,2,3]), np.array([1,2,3])], numSlices=2)
rdd.reduce(lambda x,y: np.dot(x,y))

# test pyarrow:
import pyspark.sql.functions as F
df = spark.range(0, 1000).withColumn('id', (F.col('id') / 100).cast('integer')).withColumn('v', F.rand())

@F.pandas_udf(df.schema, F.PandasUDFType.GROUPED_MAP)
def pandas_subtract_mean(pdf):
    return pdf.assign(v=pdf.v - pdf.v.mean())

df2 = df.groupby('id').apply(pandas_subtract_mean)
df2.show()

I have to use --archives rather than --files or --py-files because even though python can import modules from .zip files; it can't do so for modules with C extension .so files. --archives unzips the file and then we insert it into the beginning of the exectutor's PYTHONPATH.

I think I can wrap up the 'pyspark assembly' building steps into the debian packaging. Then we can deploy a temporary pyspark wrapper that will add these options until the cluster is fully upgraded to buster.

I think its probably good that include all these dependency .whl files (not just pyarrow) with the spark2 packaging anyway. Just having them installed in /usr/lib/spark2/python puts them in spark's PYTHONPATH by default; overriding the debian installed python packages.

Whoa wait a minute. If I can do ^, I can upgrade to Spark 2.4.3 before buster. I'm not relying on the Debian packaged python dependencies anymore...I might even be able to build the spark2 package with the python deps for both python3.5 and python3.7 and just set PYTHONPATH accordingly. E.g. have /usr/lib/spark2/{python3.5,python3.7}, and then symlink /usr/lib/spark2/python to the default python for the dist. Then if running with spark.pyspark.python=python3.7 on Stretch we just set spark.executorEnv.PYTHONPATH=/usr/lib/spark2/python3.7

HM! Will try tomorrow.

(not sure why this comment went here from email... )

Thanks Andrew!
Kate

Kate Zimmerman (she/they)
Head of Product Analytics
Wikimedia Foundation

Change 526527 abandoned by Ottomata:
Release 2.4.3 for Debian Buster

https://gerrit.wikimedia.org/r/526527

Ok! I've uploaded spark2_2.3.1-bin-hadoop2.6-4_all.deb to apt for both buster and stretch, and installed on stat1005 (buster). This seems to be working great there.

It won't work in YARN mode until spark2_2.3.1-bin-hadoop2.6-4_all.deb is also installed on all workers. This should be harmless; as it is the exact same Spark version they currently have; just with updated python dependencies. If it affects anything at all, it'll only be pyspark users that use numpy, pandas, or pyarrow, as the versions of those packages have been upgraded.

@elukey since I'm leaving tomorrow, I'll leave it up to you if you want to install this package on the cluster. It *should* be harmless. (But you know...famous last words)

Ottomata changed the point value for this task from 0 to 8.Aug 7 2019, 9:18 PM

Interesting issue: python3-tk seems to require python3.5, forcing apt to uninstall python3.7 and libpython3.7, that puppet tries to add back. From git blame we added the package because:

Install python(3)-tk so that Jupyter can render charts with matplotlib

https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/444735/

Change 530556 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::cluster::packages::common: temp remove python3-tk

https://gerrit.wikimedia.org/r/530556

Change 530556 merged by Elukey:
[operations/puppet@production] profile::analytics::cluster::packages::common: temp remove python3-tk

https://gerrit.wikimedia.org/r/530556

Change 532403 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use oozie spark sharelib instead of one from spark 1 package

https://gerrit.wikimedia.org/r/532403

Weird! Pretty dumb, it is because python3-tk has Depends: python3 (>= 3.5), python3 (<< 3.6). Even though python3.5 is available, apt is removing the python3.7 from pyall.

I think removing python3-tk is fine for now.

I'll go ahead and proceed with the updated spark 2.3 package install.

Mentioned in SAL (#wikimedia-analytics) [2019-08-26T19:06:56Z] <ottomata> update spark2 package to -4 version with support for python3.7 across cluster. T229347

I installed spark2 2.3.1-bin-hadoop2.6-4 everywhere, and now the numpy and pyarrow/pandas test in yarn works from stretch with python 3.5 and 3.7, and in Buster. Unfortunetly, as discovered in T231067, Buster no longer has a Java 8 package available, and spark2 for now is only compatible with Java 8. So, even though this works from stat1005 (where I installed Java 8 a while ago), it will not work on Buster in general until we resolve any Java 11 + Spark 2 issues. I will make another task for that.

To use pyspark2 in yarn from stat1005:

PYSPARK_PYTHON=python3.7 JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64  pyspark2 --master yarn

PYSPARK_PYTHON=python3.7 must be set so any worker executors know to load the proper python3.7 libs. JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 must be set to ensure that spark2 is launched using Java 8.

Change 532403 merged by Ottomata:
[operations/puppet@production] Check that oozie is installed (not spark 1) for installing sharelib

https://gerrit.wikimedia.org/r/532403

Change 602386 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/debs/spark2@debian] Default PYSPARK_PYTHON to exact versioned python executable used on driver.

https://gerrit.wikimedia.org/r/602386

Change 602386 merged by Ottomata:
[operations/debs/spark2@debian] Default PYSPARK_PYTHON to exact versioned python executable used on driver.

https://gerrit.wikimedia.org/r/602386

Hi all i think there may be a new variant of this issue. an-test-worker1001 is now running bullseye which uses python3.9 (not 3.7)and is currnely failing puppet with the following error due to

hange from 'purged' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install python3.7' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
Package python3.7 is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'python3.7' has no installation candidate

	/etc/puppet/modules/profile/manifests/python37.pp:16
BTullis subscribed.

Hi @jbond - Many thanks for the heads-up.
We've been working on the upgraded an-test-worker1001 to try to work out what needs to be adapted in order to support Hadoop on bullseye, so apologies for the persistent puppet errors.

In the end we decided to:

I've been reaching out to the of the people who used these packages, preparing them the the upgrade to python 3.9 when the bullseye upgrade on the test cluster arrives.

We've still got a few isues to work through, but we're getting there. In the meantime, I'll close this ticket again.