Page MenuHomePhabricator

Pyspark on SWAP: Py4JJavaError: Import Error: no module named pyarrow
Closed, ResolvedPublic3 Estimated Story PointsBUG REPORT

Description

I'm getting a weird error using pyspark in Swap

I think it may be related to using a udf in my code. See T222253
The problem might be that pyarrow isn't installed on worker nodes, but it is available in the notebook

This stack overflow thread seems relevant. https://stackoverflow.com/questions/51084514/apply-function-per-group-in-pyspark-pandas-udf-no-module-named-pyarrow

Steps to Reproduce:
run the following jupyter notebook on notebook1004 through cell 28.

/user/nathante/notebooks/Bias_analysis_spark.ipynb

Actual Results:
Py4JJavaError: Import Error: no module named pyarrow

Expected Results:
Print the results of my query.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

https://phabricator.wikimedia.org/T202812 is related to a previous request, maybe it helps! Going to check tomorrow :)

@elukey, thanks. It seems like I'm experiencing a regression then. I can work around it for now. See you tomorrow!

Via pyspark2 seems working:

elukey@stat1004:~$ pyspark2 --master yarn
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Python 3.5.3 (default, Sep 27 2018, 17:25:39)
Type "copyright", "credits" or "license" for more information.

IPython 5.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.1
      /_/

Using Python version 3.5.3 (default, Sep 27 2018 17:25:39)
SparkSession available as 'spark'.

In [1]: import pyarrow

In [2]:

@elukey : in pyspark import only uses code on the driver, meaning the local machine in case of shell (stat1004 in the example you gave).
Using pyarrow in executors can still fail if the lib is not present on workers :)

@elukey : in pyspark import only uses code on the driver, meaning the local machine in case of shell (stat1004 in the example you gave).
Using pyarrow in executors can still fail if the lib is not present on workers :)

I didn't know it, thanks for the explanation. Is there a quick test that we can do to see if it works or not?

Below is a code example that works in local-mode on stat1004 (pyspark2 --master local[2]) and fails in yarn mode (pyspark2 --master yarn):

import pyspark.sql.functions as F

df = spark.range(0, 1000).withColumn('id', (F.col('id') / 100).cast('integer')).withColumn('v', F.rand())

@F.pandas_udf(df.schema, F.PandasUDFType.GROUPED_MAP)
def pandas_subtract_mean(pdf):
    return pdf.assign(v=pdf.v - pdf.v.mean())

df2 = df.groupby('id').apply(pandas_subtract_mean)
df2.show()

Interestingly, the given example don't fail if the group-by has nothing to do (single key).

Ottomata subscribed.

TODO: check if we can use the debian python-arrow package instead of building a version into our custom spark2 package.

Ottomata triaged this task as Medium priority.May 2 2019, 5:10 PM

So, I'm not sure why pyarrow wouldn't load in YARN, it is available on all the workers. hhm.

NO WAY !!!! I'm super sorry for having derailed that :(

@JAllemandou I couldn't get your example to fail in either local or YARN.

@Groceryheist, can you paste your python code that fails? I don't use notebooks often and there isn't an easy way to view your .ipynb file.

!!!

Quite a few of the worker nodes DID NOT have pyarrow! They were running my older 2.3.1-bin-hadoop2.6-1~stretch1 version that didn't have pyarrow included. I just ran a cumin command to upgrade those that didn't to 2.3.1-bin-hadoop2.6-3~stretch1.

sudo cumin 'O:analytics_cluster::hadoop::worker' 'apt-get install spark2'

All hosts should now have the proper version with pyarrow.
https://debmonitor.wikimedia.org/packages/spark2

@Groceryheist, can you try now?

Hm actually, I'd hope that the spark-assembly.zip that we ship to HDFS would be used for this, and that it would include the proper python dependencies too. I just checked, and it doesn't! It only contains the java deps.

TODO: see if we can put the python deps in the spark-assembly.zip file when we build the package.

This was resolved by shipping pyarrow and other python deps with the spark2 package.

Ottomata set the point value for this task to 3.
Ottomata added a project: Analytics-Kanban.
Ottomata moved this task from Next Up to Done on the Analytics-Kanban board.