Page MenuHomePhabricator

PySpark 2 cannot find numpy
Closed, InvalidPublic

Description

I have pyspark running as so:

pyspark2 --master yarn --executor-memory 4G --executor-cores 2 --driver-memory 8G --conf spark.driver.maxResultSize=4G

The following command gives an error:

from pyspark.ml.regression import RandomForestRegressor

And the error message is:

ImportError: No module named numpy

Anything I can do to install numpy in worker machines?

Event Timeline

Installing numpy locally seems to have solved the problem for now.

Hi! We have deployed python3-numpy and python-numpy on the analytics worker nodes and on the stat boxes (I just tested import numpy in python2/3 and it seems to work).

Can you tell me what you installed and where? Just to have an idea about how to reproduce..

Just tried your example in the description on stat1004 and works fine to me!

@elukey, thanks for the reply. I have a virtual environment under /home/bmansurov/venv/2/ for python2 in stat1005. So the problem was happening before I installed numpy in that evironment. I would activate it and run the pyspark2 command in the description, but the import statement would return the error. After installing numpy, I'm not getting an error, but I suspect I may once I run some code in worker nodes.

I tested pyspark even on stat1005 and it seems to work fine, I suppose that this means that your venv was preventing pyspark to pick up the numpy packages installed on stat/worker-nodes hosts?

That seems like a reasonable explanation. I'll close the task. Thanks for looking into it.

elukey renamed this task from fibaaaaaaa to PySpark 2 cannot find numpy.Jul 2 2018, 6:18 AM
elukey closed this task as Resolved.
elukey changed the task status from Resolved to Invalid.
elukey raised the priority of this task from High to Needs Triage.
elukey updated the task description. (Show Details)