Upon connecting to the cluster with SparkR from stat1005 the following behavior - possibly indicating that R is not installed on all workers - is observed:
1. Connect (note: library(SparkR) is already loaded)
sparkR.session(master = "yarn", appName = "SparkR", sparkHome = "/usr/lib/spark2/", sparkConfig = list(spark.driver.memory = "4g", spark.driver.cores = "1", spark.executor.memory = "2g"))
It seems Ok:
Spark package found in SPARK_HOME: /usr/lib/spark2/ Launching java with spark-submit command /usr/lib/spark2//bin/spark-submit --driver-memory "4g" sparkr-shell /tmp/RtmpPd5D4Z/backend_portb1d9abf347 Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 18/04/17 10:24:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Java ref type org.apache.spark.sql.SparkSession id 1
2. Do something (anything, in fact, will result in the same problem):
df <- createDataFrame(iris) Warning messages: 1: In FUN(X[[i]], ...) : Use Sepal_Length instead of Sepal.Length as column name 2: In FUN(X[[i]], ...) : Use Sepal_Width instead of Sepal.Width as column name 3: In FUN(X[[i]], ...) : Use Petal_Length instead of Petal.Length as column name 4: In FUN(X[[i]], ...) : Use Petal_Width instead of Petal.Width as column name
and here we go then:
head(filter(df, df$Sepal_length > 0))
This results in what is essentially reported as:
18/04/17 10:28:25 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, analytics1045.eqiad.wmnet, executor 2): java.io.IOException: Cannot run program "Rscript": error=2, No such file or directory
(verbose output is suppressed here).
Upon searching for a possible cause for some time, I have note that typically the first advise is to ask oneself whether R is installed on all worker nodes.
NOTE. The same happens upon starting a SparkR session as documented on Analytics/Systems/Cluster/Spark:
spark2R --master yarn --executor-memory 2G --executor-cores 1 --driver-memory 4G
Please advise.
NOTE: Since the above described problem was resolved, the ticket will be now used to report upon the results of SparkR tests.