Change Details

Upon connecting to the cluster with `SparkR` from **stat1005** the following behavior - possibly indicating that R is not installed on all workers - is observed: **1. Connect (note: library(SparkR) is already loaded)** ``` sparkR.session(master = "yarn", appName = "SparkR", sparkHome = "/usr/lib/spark2/", sparkConfig = list(spark.driver.memory = "4g", spark.driver.cores = "1", spark.executor.memory = "2g")) ``` It seems Ok: ``` Spark package found in SPARK_HOME: /usr/lib/spark2/ Launching java with spark-submit command /usr/lib/spark2//bin/spark-submit --driver-memory "4g" sparkr-shell /tmp/RtmpPd5D4Z/backend_portb1d9abf347 Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 18/04/17 10:24:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Java ref type org.apache.spark.sql.SparkSession id 1 ``` **2. Do something (anything, in fact, will result in the same problem):** ``` df <- createDataFrame(iris) Warning messages: 1: In FUN(X[[i]], ...) : Use Sepal_Length instead of Sepal.Length as column name 2: In FUN(X[[i]], ...) : Use Sepal_Width instead of Sepal.Width as column name 3: In FUN(X[[i]], ...) : Use Petal_Length instead of Petal.Length as column name 4: In FUN(X[[i]], ...) : Use Petal_Width instead of Petal.Width as column name ``` and here we go then: ``` head(filter(df, df$Sepal_length > 0)) ``` This results in what is essentially reported as: ```18/04/17 10:28:25 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, analytics1045.eqiad.wmnet, executor 2): java.io.IOException: Cannot run program "Rscript": error=2, No such file or directory``` (verbose output is suppressed here). Upon searching for a possible cause for some time, I have note that typically the first advise is to ask oneself whether R is installed on all worker nodes. **NOTE.** The same happens upon starting a SparkR session as documented on [[ https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark | Analytics/Systems/Cluster/Spark ]]: ``` spark2R --master yarn --executor-memory 2G --executor-cores 1 --driver-memory 4G ``` Please advise.