Page MenuHomePhabricator

java.io.IOException: Permission denied when trying to access the hadoop cluster
Closed, ResolvedPublic

Description

When using pyspark3 --master yarn (or Jupyter, or spark3-submit) on stat1008 or stat1010 and running a simple query that accesses hadoop:

import wmfdata as wmf

countries = wmf.spark.run("""
SELECT *
FROM canonical_data.countries
""")

It crashes with

---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
Cell In[2], line 1
----> 1 countries = wmf.spark.run("""
      2 SELECT *
      3 FROM canonical_data.countries
      4 """)

File /opt/conda-analytics/lib/python3.10/site-packages/wmfdata/spark.py:282, in run(commands)
    279 overall_result = None
    281 for cmd in commands:
--> 282     cmd_result = session.sql(cmd)
    283     # If the result has columns, the command was a query and therefore
    284     # results-producing. If not, it was a DDL or DML command and not
    285     # results-producing.
    286     if len(cmd_result.columns) > 0:

File /opt/conda-analytics/lib/python3.10/site-packages/pyspark/sql/session.py:723, in SparkSession.sql(self, sqlQuery)
    707 def sql(self, sqlQuery):
    708     """Returns a :class:`DataFrame` representing the result of the given query.
    709 
    710     .. versionadded:: 2.0.0
   (...)
    721     [Row(f1=1, f2='row1'), Row(f1=2, f2='row2'), Row(f1=3, f2='row3')]
    722     """
--> 723     return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)

File /opt/conda-analytics/lib/python3.10/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py:1304, in JavaMember.__call__(self, *args)
   1298 command = proto.CALL_COMMAND_NAME +\
   1299     self.command_header +\
   1300     args_command +\
   1301     proto.END_COMMAND_PART
   1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
   1305     answer, self.gateway_client, self.target_id, self.name)
   1307 for temp_arg in temp_args:
   1308     temp_arg._detach()

File /opt/conda-analytics/lib/python3.10/site-packages/pyspark/sql/utils.py:117, in capture_sql_exception.<locals>.deco(*a, **kw)
    113 converted = convert_exception(e.java_exception)
    114 if not isinstance(converted, UnknownException):
    115     # Hide where the exception came from that shows a non-Pythonic
    116     # JVM exception message.
--> 117     raise converted from None
    118 else:
    119     raise

AnalysisException: java.lang.RuntimeException: java.io.IOException: Permission denied

Checking my user groups shows:

tchin@stat1008:~$ groups
wikidev render analytics-privatedata-users analytics-admins

The query succeeds on stat1009.

@gmodena has tried this and it works for him on stat1010 but not stat1008

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Update the guidance in the run_dev_instances.sh scriptrepos/data-engineering/airflow-dags!934btullisupdate_dev_docsmain
Customize query in GitLab

Event Timeline

I'm currently unable to reproduce this on stat1008.

I tried it with an existing conda environment in Jupyter, then I created a new cloned environment (as part of testing T380477) and both worked for me.

My groups only differ from your in that I am also a member of adm and ops but I doubt that this is the cause.

btullis@stat1008:~$ groups
wikidev adm render ops analytics-privatedata-users analytics-admins

I then tried pyspark3 --master yarn and that worked too.

image.png (705×913 px, 82 KB)

I can repro by simply trying to run spark3-sql in the shell.

Fails on 1008 at application startup.

gmodena@stat1008:~$ spark3-sql 
Running /opt/conda-analytics/bin/spark-sql $@
SPARK_HOME: /opt/conda-analytics/lib/python3.10/site-packages/pyspark
Using Hadoop client lib jars at 3.2.0, provided by Spark.
PYSPARK_PYTHON=/opt/conda-analytics/bin/python3
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Exception in thread "main" java.lang.RuntimeException: java.io.IOException: Permission denied
	at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:593)
	at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:548)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137)
	at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: Permission denied
	at java.io.UnixFileSystem.createFileExclusively(Native Method)
	at java.io.File.createTempFile(File.java:2063)
	at org.apache.hadoop.hive.common.FileUtils.createTempFile(FileUtils.java:799)
	at org.apache.hadoop.hive.ql.session.SessionState.createTempFile(SessionState.java:957)
	at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:591)
	... 15 more

Works on 1010:

gmodena@stat1010:~$ spark3-sql 
Running /opt/conda-analytics/bin/spark-sql $@
SPARK_HOME: /opt/conda-analytics/lib/python3.10/site-packages/pyspark
Using Hadoop client lib jars at 3.2.0, provided by Spark.
PYSPARK_PYTHON=/opt/conda-analytics/bin/python3
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/21 18:03:43 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
24/11/21 18:03:43 WARN Utils: Service 'sparkDriver' could not bind on port 12000. Attempting port 12001.
24/11/21 18:03:43 WARN Utils: Service 'sparkDriver' could not bind on port 12001. Attempting port 12002.
24/11/21 18:03:43 WARN Utils: Service 'sparkDriver' could not bind on port 12002. Attempting port 12003.
24/11/21 18:03:43 WARN Utils: Service 'sparkDriver' could not bind on port 12003. Attempting port 12004.
24/11/21 18:03:43 WARN Utils: Service 'sparkDriver' could not bind on port 12004. Attempting port 12005.
24/11/21 18:03:43 WARN Utils: Service 'sparkDriver' could not bind on port 12005. Attempting port 12006.
24/11/21 18:03:44 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/11/21 18:03:44 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
24/11/21 18:03:44 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
24/11/21 18:03:44 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
24/11/21 18:03:44 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.
24/11/21 18:03:44 WARN Utils: Service 'SparkUI' could not bind on port 4045. Attempting port 4046.
24/11/21 18:03:44 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13000. Attempting port 13001.
24/11/21 18:03:44 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13001. Attempting port 13002.
24/11/21 18:03:44 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13002. Attempting port 13003.
24/11/21 18:03:44 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13003. Attempting port 13004.
24/11/21 18:03:44 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13004. Attempting port 13005.
24/11/21 18:03:44 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13005. Attempting port 13006.
ADD JAR file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar
24/11/21 18:03:46 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Added [file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar] to class path
Added resources: [file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar]
Spark master: local[*], Application Id: local-1732212224251
spark-sql (default)>

Weird. Once again, it works for me on stat1008. I think that this must be something to do with the contents of hour home directory.

btullis@stat1008:~$ spark3-sql
Running /opt/conda-analytics/bin/spark-sql $@
SPARK_HOME: /opt/conda-analytics/lib/python3.10/site-packages/pyspark
Using Hadoop client lib jars at 3.2.0, provided by Spark.
PYSPARK_PYTHON=/opt/conda-analytics/bin/python3
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/21 18:10:08 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12000. Attempting port 12001.
24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12001. Attempting port 12002.
24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12002. Attempting port 12003.
24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12003. Attempting port 12004.
24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12004. Attempting port 12005.
24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12005. Attempting port 12006.
24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12006. Attempting port 12007.
24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12007. Attempting port 12008.
24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12008. Attempting port 12009.
24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12009. Attempting port 12010.
24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12010. Attempting port 12011.
24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.
24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4045. Attempting port 4046.
24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4046. Attempting port 4047.
24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4047. Attempting port 4048.
24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4048. Attempting port 4049.
24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4049. Attempting port 4050.
24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4050. Attempting port 4051.
24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13000. Attempting port 13001.
24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13001. Attempting port 13002.
24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13002. Attempting port 13003.
24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13003. Attempting port 13004.
24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13004. Attempting port 13005.
24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13005. Attempting port 13006.
24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13006. Attempting port 13007.
24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13007. Attempting port 13008.
24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13008. Attempting port 13009.
24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13009. Attempting port 13010.
24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13010. Attempting port 13011.
ADD JAR file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar
24/11/21 18:10:12 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Added [file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar] to class path
Added resources: [file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar]
Spark master: local[*], Application Id: local-1732212609631
spark-sql (default)>

I've found out what is causing this. There is an easy enough short-term fix, but a few options about how to stop it happening again.

The reason that spark fails to run is that it is trying to open the directory /tmp/$username/ as a local scratchdir with write permission, which is reasonable enough.

Caused by: java.io.IOException: Permission denied
	at java.io.UnixFileSystem.createFileExclusively(Native Method)
	at java.io.File.createTempFile(File.java:2063)
	at org.apache.hadoop.hive.common.FileUtils.createTempFile(FileUtils.java:799)

However, in this case /tmp/gmodena and /tmp/tchin were already present and were owned by the analytics-privatedata user, whereas /home/btullis was owned by btullis.

gmodena@stat1008:~$ ls -ld /tmp/gmodena/
drwxr-xr-x 6 analytics-privatedata analytics-privatedata 4096 Nov 14 12:06 /tmp/gmodena/

gmodena@stat1008:~$ ls -ld /tmp/tchin/
drwxr-xr-x 6 analytics-privatedata analytics-privatedata 4096 Sep  3 18:48 /tmp/tchin/

gmodena@stat1008:~$ ls -ld /tmp/btullis/
drwx------ 3 btullis wikidev 4096 Nov 21 18:15 /tmp/btullis/

I believe that this is caused by a path clash with the run_dev_instance.sh script from the Airflow-DAGs repository.

If run_dev_instance.sh is executed with -m /tmp/my_username (as per the example given) before a user runs spark for the first time, then the /tmp/my_username path will be created with mkdir -p by the system user being used. Thus breaking spark.

I was able to fix spark in the short term with:

btullis@stat1008:~$ sudo chown gmodena /tmp/gmodena
btullis@stat1008:~$ sudo chown tchin /tmp/tchin

...but this won't stop it happening again in future.

I'm hopeful that we will be able to get rid of the run_dev_instance.sh script in the near future, as the work to migrate Airflow to Kubernetes reaches its conclusion. However, until then, it would be useful to have something to help avoid a repeat of this incident.

It's also worth noting that the hive.exec.scratchdir is configurable: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=27362070#AdminManualConfiguration-HiveConfigurationVariables
...so we could change that as well.

Spark is also about to be upgraded to from version 3.1.2 to version 3.5.3 so this behaviour may also change as part of that upgrade.

Maybe just adding some guidance to https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Developer_guide#Development_instance would be a good enough fiox, for now.

Wow this is some amazing detective work! Thanks Ben!!