Page MenuHomePhabricator

Cannot run hive queries with wmfdata package and spark
Closed, ResolvedPublic

Description

The notebook I use for Growth team reporting is not running like it did last week. I believe the error has something do with changes in wmfdata package or with the spark.

My notebook is located at notebook1004.eqiad.wmnet:/user/mmiller/notebooks/homepage_reporting-2020-01-06.ipynb.

I am trying to run hive.run commands with these parameters:

spark_master = 'yarn',
spark_config = {'spark.driver.memory': '8g',
                'spark.executor.memory' : '2g'})

These are the errors I see:

image.png (400×873 px, 82 KB)

image.png (70×716 px, 14 KB)

Event Timeline

MMiller_WMF renamed this task from Cannot run notebook with wmfdata package and spark to Cannot run hive queries with wmfdata package and spark.Jan 13 2020, 10:46 PM
MMiller_WMF created this task.
MMiller_WMF updated the task description. (Show Details)

I've started working on this. I have successfully reproduced the problem, so I should be able to put out a fix within the next couple of hours.

nshahquinn-wmf claimed this task.

Okay, I've fixed the ImportError with this commit. We made a change to how we declared the package's version that caused it to try to import some of its dependencies before they had actually been installed.

I'm not sure about the Py4JJavaError; it seems like your connection to Spark was closed before the query finished running. I don't think you could have encountered it in the same session as the ImportError because the ImportError caused the installation to fail totally.

@MMiller_WMF, if you run into any further problems after upgrading to the latest version, please let me know!