Page MenuHomePhabricator

spark3 in yarn master mode exhibits warnings when the HDFS namenodes are in the failed over state
Open, LowPublic

Description

When the NDFS namesnodes are in the failed over state, a spark3 session exhibits warnings that seem to indicate that it is connecting to the standby server.

For example, here is the service state of the HDFS namenodes.

btullis@an-master1001:/var/log/hadoop-hdfs$ sudo kerberos-run-command hdfs /usr/bin/hdfs haadmin -getAllServiceState
an-master1001.eqiad.wmnet:8040                     standby   
an-master1002.eqiad.wmnet:8040                     active

Here is a test spark3-shell session being started on a stat server.

btullis@stat1004:/etc/hadoop/conf$ spark3-shell --master yarn
Running /opt/conda-analytics/bin/spark-shell $@
SPARK_HOME: /opt/conda-analytics/lib/python3.10/site-packages/pyspark
Using Hadoop client lib jars at 3.2.0, provided by Spark.
PYSPARK_PYTHON=/opt/conda-analytics/bin/python3
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/06/05 11:03:20 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
23/06/05 11:03:22 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
23/06/05 11:03:22 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
23/06/05 11:03:23 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
23/06/05 11:03:28 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
Spark context Web UI available at http://stat1004.eqiad.wmnet:4040
Spark context available as 'sc' (master = yarn, app id = application_1678266962370_533858).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.2
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_362)
Type in expressions to have them evaluated.
Type :help for more information.

Note the lines:

23/06/05 11:03:22 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error

These are shown repeatedly to the user and also appear when the user attempts to view the logs of their application.

btullis@stat1004:/etc/hadoop/conf$ yarn logs -applicationId application_1678266962370_533916
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8

End of LogType:prelaunch.err
******************************************************************************

Container: container_e75_1678266962370_533916_01_000001 on an-worker1147.eqiad.wmnet_8041_1685964080990
LogAggregationType: AGGREGATED
=======================================================================================================
LogType:prelaunch.out
LogLastModifiedTime:Mon Jun 05 11:21:21 +0000 2023
LogLength:70
LogContents:
Setting up env variables
Setting up job resources
Launching container

End of LogType:prelaunch.out
******************************************************************************

Container: container_e75_1678266962370_533916_01_000001 on an-worker1147.eqiad.wmnet_8041_1685964080990
LogAggregationType: AGGREGATED
=======================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Jun 05 11:21:21 +0000 2023
LogLength:6340
LogContents:
23/06/05 11:17:25 INFO SignalUtils: Registering signal handler for TERM
23/06/05 11:17:25 INFO SignalUtils: Registering signal handler for HUP
23/06/05 11:17:25 INFO SignalUtils: Registering signal handler for INT
23/06/05 11:17:25 INFO SecurityManager: Changing view acls to: btullis
23/06/05 11:17:25 INFO SecurityManager: Changing modify acls to: btullis
23/06/05 11:17:25 INFO SecurityManager: Changing view acls groups to: 
23/06/05 11:17:25 INFO SecurityManager: Changing modify acls groups to: 
23/06/05 11:17:25 INFO SecurityManager: SecurityManager: authentication enabled; ui acls disabled; users  with view permissions: Set(btullis); groups with view permissions: Set(); users  with modify permissions: Set(btullis); groups with modify permissions: Set()
23/06/05 11:17:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/06/05 11:17:25 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1678266962370_533916_000001
23/06/05 11:17:26 INFO YarnRMClient: Registering the ApplicationMaster
23/06/05 11:17:26 INFO TransportClientFactory: Successfully created connection to stat1004.eqiad.wmnet/10.64.5.104:12000 after 256 ms (193 ms spent in bootstraps)
23/06/05 11:17:26 INFO SparkHadoopUtil: Updating delegation tokens for current user.
23/06/05 11:17:26 INFO ApplicationMaster: Preparing Local resources
23/06/05 11:17:27 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
23/06/05 11:17:27 INFO ApplicationMaster: 
===============================================================================
Default YARN executor launch context:
  env:
    CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>{{PWD}}/__spark_conf__/__hadoop_conf__
    SPARK_YARN_STAGING_DIR -> hdfs://analytics-hadoop/user/btullis/.sparkStaging/application_1678266962370_533916
    SPARK_USER -> btullis
    PYTHONPATH -> /opt/conda-analytics/lib/python3.10/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip:/opt/conda-analytics/lib/python3.10/site-packages/pyspark/python/::/srv/deployment/analytics/refinery/python<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.9-src.zip
    LD_LIBRARY_PATH -> /usr/lib/hadoop/lib/native
    REQUESTS_CA_BUNDLE -> /etc/ssl/certs/ca-certificates.crt

  command:
    {{JAVA_HOME}}/bin/java \ 
      -server \ 
      -Xmx1024m \ 
      '-Djava.net.useSystemProxies=True' \ 
      -Djava.io.tmpdir={{PWD}}/tmp \ 
      '-Dspark.network.crypto.keyLength=256' \ 
      '-Dspark.network.crypto.enabled=true' \ 
      '-Dspark.network.crypto.keyFactoryAlgorithm=PBKDF2WithHmacSHA256' \ 
      '-Dspark.driver.blockManager.port=13000' \ 
      '-Dspark.ui.port=4040' \ 
      '-Dspark.network.crypto.saslFallback=false' \ 
      '-Dspark.driver.port=12000' \ 
      '-Dspark.authenticate=true' \ 
      '-Dspark.port.maxRetries=100' \ 
      -Dspark.yarn.app.container.log.dir=<LOG_DIR> \ 
      -XX:OnOutOfMemoryError='kill %p' \ 
      org.apache.spark.executor.YarnCoarseGrainedExecutorBackend \ 
      --driver-url \ 
      spark://CoarseGrainedScheduler@stat1004.eqiad.wmnet:12000 \ 
      --executor-id \ 
      <executorId> \ 
      --hostname \ 
      <hostname> \ 
      --cores \ 
      1 \ 
      --app-id \ 
      application_1678266962370_533916 \ 
      --resourceProfileId \ 
      0 \ 
      --user-class-path \ 
      file:$PWD/__app__.jar \ 
      1><LOG_DIR>/stdout \ 
      2><LOG_DIR>/stderr

  resources:
    pyspark.zip -> resource { scheme: "hdfs" host: "analytics-hadoop" port: -1 file: "/user/btullis/.sparkStaging/application_1678266962370_533916/pyspark.zip" } size: 886596 timestamp: 1685963842629 type: FILE visibility: PRIVATE
    __spark_libs__ -> resource { scheme: "hdfs" host: "analytics-hadoop" port: -1 file: "/user/spark/share/lib/spark-3.1.2-assembly.jar" } size: 255798851 timestamp: 1683815961934 type: ARCHIVE visibility: PUBLIC
    py4j-0.10.9-src.zip -> resource { scheme: "hdfs" host: "analytics-hadoop" port: -1 file: "/user/btullis/.sparkStaging/application_1678266962370_533916/py4j-0.10.9-src.zip" } size: 41587 timestamp: 1685963842717 type: FILE visibility: PRIVATE
    __spark_conf__ -> resource { scheme: "hdfs" host: "analytics-hadoop" port: -1 file: "/user/btullis/.sparkStaging/application_1678266962370_533916/__spark_conf__.zip" } size: 203054 timestamp: 1685963842824 type: ARCHIVE visibility: PRIVATE

===============================================================================
23/06/05 11:17:27 INFO Utils: Using initial executors = 0, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
23/06/05 11:17:27 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
23/06/05 11:17:27 INFO YarnAllocator: Resource profile 0 doesn't exist, adding it
23/06/05 11:17:27 INFO Configuration: resource-types.xml not found
23/06/05 11:17:27 INFO ResourceUtils: Unable to find 'resource-types.xml'.
23/06/05 11:17:27 INFO YarnAllocator: Resource profile 0 doesn't exist, adding it
23/06/05 11:17:27 INFO ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
23/06/05 11:21:19 INFO ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. stat1004.eqiad.wmnet:12000
23/06/05 11:21:19 INFO ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. stat1004.eqiad.wmnet:12000
23/06/05 11:21:19 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
23/06/05 11:21:19 INFO ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED
23/06/05 11:21:19 INFO AMRMClientImpl: Waiting for application to be successfully unregistered.
23/06/05 11:21:19 INFO ApplicationMaster: Deleting staging directory hdfs://analytics-hadoop/user/btullis/.sparkStaging/application_1678266962370_533916
23/06/05 11:21:19 INFO ShutdownHookManager: Shutdown hook called

End of LogType:stderr
***********************************************************************


End of LogType:stdout
***********************************************************************

Container: container_e75_1678266962370_533916_01_000001 on an-worker1147.eqiad.wmnet_8041_1685964080990
LogAggregationType: AGGREGATED
=======================================================================================================
LogType:container-localizer-syslog
LogLastModifiedTime:Mon Jun 05 11:21:21 +0000 2023
LogLength:4457
LogContents:
2023-06-05 11:17:23,651 INFO [main] org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer: Disk Validator: yarn.nodemanager.disk-validator is loaded.
2023-06-05 11:17:24,378 WARN [ContainerLocalizer Downloader] org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
	at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:374)
	at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:629)
	at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:423)
	at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:833)
	at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:829)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1938)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:829)
	at org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:423)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1621)
	at org.apache.hadoop.ipc.Client.call(Client.java:1450)
	at org.apache.hadoop.ipc.Client.call(Client.java:1403)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
	at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:800)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
	at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1680)
	at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1524)
	at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1521)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1536)
	at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
	at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
	at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:366)
	at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:364)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1938)
	at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:364)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:242)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:235)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:223)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

End of LogType:container-localizer-syslog
*******************************************************************************************

Event Timeline

Note that a spark2-shell --master yarn does not exhibit these warnings.

btullis@stat1004:/etc/hadoop/conf$ spark2-shell --master yarn
PYSPARK_PYTHON=python3.7
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark2/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/06/05 11:24:58 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
23/06/05 11:25:05 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
Spark context Web UI available at http://stat1004.eqiad.wmnet:4040
Spark context available as 'sc' (master = yarn, app id = application_1678266962370_533928).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/
         
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_362)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

However, yarn logs for the spark2 application ID does still show similar errors.

btullis@stat1004:/etc/hadoop/conf$ yarn logs -applicationId application_1678266962370_533928
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Container: container_e75_1678266962370_533928_01_000001 on an-worker1137.eqiad.wmnet_8041_1685964311498
LogAggregationType: AGGREGATED
=======================================================================================================
LogType:container-localizer-syslog
LogLastModifiedTime:Mon Jun 05 11:25:11 +0000 2023
LogLength:4457
LogContents:
2023-06-05 11:25:01,714 INFO [main] org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer: Disk Validator: yarn.nodemanager.disk-validator is loaded.
2023-06-05 11:25:02,466 WARN [ContainerLocalizer Downloader] org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
	at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:374)
	at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:629)
	at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:423)
	at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:833)
	at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:829)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1938)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:829)
	at org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:423)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1621)
	at org.apache.hadoop.ipc.Client.call(Client.java:1450)
	at org.apache.hadoop.ipc.Client.call(Client.java:1403)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
	at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:800)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
	at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1680)
	at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1524)
	at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1521)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1536)
	at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
	at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
	at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:366)
	at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:364)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1938)
	at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:364)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:242)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:235)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:223)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

End of LogType:container-localizer-syslog
*******************************************************************************************


End of LogType:prelaunch.err
******************************************************************************

Container: container_e75_1678266962370_533928_01_000001 on an-worker1137.eqiad.wmnet_8041_1685964311498
LogAggregationType: AGGREGATED
=======================================================================================================
LogType:prelaunch.out
LogLastModifiedTime:Mon Jun 05 11:25:11 +0000 2023
LogLength:70
LogContents:
Setting up env variables
Setting up job resources
Launching container

End of LogType:prelaunch.out
******************************************************************************

Container: container_e75_1678266962370_533928_01_000001 on an-worker1137.eqiad.wmnet_8041_1685964311498
LogAggregationType: AGGREGATED
=======================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Jun 05 11:25:11 +0000 2023
LogLength:10406
LogContents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/var/lib/hadoop/data/c/yarn/local/filecache/27381/spark-2.4.4-assembly.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
23/06/05 11:25:03 INFO SignalUtils: Registered signal handler for TERM
23/06/05 11:25:03 INFO SignalUtils: Registered signal handler for HUP
23/06/05 11:25:03 INFO SignalUtils: Registered signal handler for INT
23/06/05 11:25:03 INFO SecurityManager: Changing view acls to: btullis
23/06/05 11:25:03 INFO SecurityManager: Changing modify acls to: btullis
23/06/05 11:25:03 INFO SecurityManager: Changing view acls groups to: 
23/06/05 11:25:03 INFO SecurityManager: Changing modify acls groups to: 
23/06/05 11:25:03 INFO SecurityManager: SecurityManager: authentication enabled; ui acls disabled; users  with view permissions: Set(btullis); groups with view permissions: Set(); users  with modify permissions: Set(btullis); groups with modify permissions: Set()
23/06/05 11:25:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/06/05 11:25:03 INFO ApplicationMaster: Preparing Local resources
23/06/05 11:25:04 WARN Client: Exception encountered while connecting to the server 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
	at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:374)
	at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:629)
	at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:423)
	at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:833)
	at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:829)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1938)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:829)
	at org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:423)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1621)
	at org.apache.hadoop.ipc.Client.call(Client.java:1450)
	at org.apache.hadoop.ipc.Client.call(Client.java:1403)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
	at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:800)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
	at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1680)
	at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1524)
	at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1521)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1536)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$8$$anonfun$apply$3.apply(ApplicationMaster.scala:220)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$8$$anonfun$apply$3.apply(ApplicationMaster.scala:217)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$8.apply(ApplicationMaster.scala:217)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$8.apply(ApplicationMaster.scala:182)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1938)
	at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
	at org.apache.spark.deploy.yarn.ApplicationMaster.<init>(ApplicationMaster.scala:182)
	at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:802)
	at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:833)
	at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
23/06/05 11:25:04 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1678266962370_533928_000001
23/06/05 11:25:05 INFO YarnRMClient: Registering the ApplicationMaster
23/06/05 11:25:05 INFO TransportClientFactory: Successfully created connection to stat1004.eqiad.wmnet/10.64.5.104:12000 after 334 ms (250 ms spent in bootstraps)
23/06/05 11:25:05 INFO ApplicationMaster: 
===============================================================================
YARN executor launch context:
  env:
    CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/*<CPS>$HADOOP_COMMON_HOME/lib/*<CPS>$HADOOP_HDFS_HOME/*<CPS>$HADOOP_HDFS_HOME/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$HADOOP_YARN_HOME/*<CPS>$HADOOP_YARN_HOME/lib/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*<CPS>/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*:/usr/share/java/apache-log4j-extras.jar:<CPS>{{PWD}}/__spark_conf__/__hadoop_conf__
    SPARK_DIST_CLASSPATH -> /etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*:/usr/share/java/apache-log4j-extras.jar:
    SPARK_YARN_STAGING_DIR -> hdfs://analytics-hadoop/user/btullis/.sparkStaging/application_1678266962370_533928
    SPARK_USER -> btullis
    LD_LIBRARY_PATH -> /usr/lib/hadoop/lib/native
    REQUESTS_CA_BUNDLE -> /etc/ssl/certs/ca-certificates.crt
    ARROW_PRE_0_15_IPC_FORMAT -> 1

  command:
    {{JAVA_HOME}}/bin/java \ 
      -server \ 
      -Xmx1024m \ 
      '-Djava.net.useSystemProxies=True' \ 
      -Djava.io.tmpdir={{PWD}}/tmp \ 
      '-Dspark.network.crypto.keyLength=256' \ 
      '-Dspark.network.crypto.enabled=true' \ 
      '-Dspark.network.crypto.keyFactoryAlgorithm=PBKDF2WithHmacSHA256' \ 
      '-Dspark.driver.blockManager.port=13000' \ 
      '-Dspark.ui.port=4040' \ 
      '-Dspark.network.crypto.saslFallback=false' \ 
      '-Dspark.driver.port=12000' \ 
      '-Dspark.authenticate=true' \ 
      '-Dspark.port.maxRetries=100' \ 
      -Dspark.yarn.app.container.log.dir=<LOG_DIR> \ 
      -XX:OnOutOfMemoryError='kill %p' \ 
      org.apache.spark.executor.CoarseGrainedExecutorBackend \ 
      --driver-url \ 
      spark://CoarseGrainedScheduler@stat1004.eqiad.wmnet:12000 \ 
      --executor-id \ 
      <executorId> \ 
      --hostname \ 
      <hostname> \ 
      --cores \ 
      1 \ 
      --app-id \ 
      application_1678266962370_533928 \ 
      --user-class-path \ 
      file:$PWD/__app__.jar \ 
      1><LOG_DIR>/stdout \ 
      2><LOG_DIR>/stderr

  resources:
    __spark_libs__ -> resource { scheme: "hdfs" host: "analytics-hadoop" port: -1 file: "/user/spark/share/lib/spark-2.4.4-assembly.zip" } size: 195258057 timestamp: 1614003250646 type: ARCHIVE visibility: PUBLIC
    __spark_conf__ -> resource { scheme: "hdfs" host: "analytics-hadoop" port: -1 file: "/user/btullis/.sparkStaging/application_1678266962370_533928/__spark_conf__.zip" } size: 245305 timestamp: 1685964300754 type: ARCHIVE visibility: PRIVATE

===============================================================================
23/06/05 11:25:05 INFO Utils: Using initial executors = 0, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
23/06/05 11:25:05 INFO Configuration: resource-types.xml not found
23/06/05 11:25:05 INFO ResourceUtils: Unable to find 'resource-types.xml'.
23/06/05 11:25:05 INFO ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
23/06/05 11:25:05 INFO ResourceUtils: Adding resource type - name = vcores, units = , type = COUNTABLE
23/06/05 11:25:05 INFO ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
23/06/05 11:25:09 INFO ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. stat1004.eqiad.wmnet:12000
23/06/05 11:25:09 INFO ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. stat1004.eqiad.wmnet:12000
23/06/05 11:25:09 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
23/06/05 11:25:09 INFO ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED
23/06/05 11:25:09 INFO AMRMClientImpl: Waiting for application to be successfully unregistered.
23/06/05 11:25:09 INFO ApplicationMaster: Deleting staging directory hdfs://analytics-hadoop/user/btullis/.sparkStaging/application_1678266962370_533928
23/06/05 11:25:10 INFO ShutdownHookManager: Shutdown hook called

End of LogType:stderr
***********************************************************************


End of LogType:stdout
***********************************************************************

So it's looking like it's something related to the yarn aggregated logging.

I'm going to fail back to the primary namenode now, to see if it restores normal behaviour.

I failed back to an-master1001 and it completed successfully:

btullis@an-master1001:/var/log/hadoop-hdfs$ sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
Failover to NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 successful

Immediately, the warnings from a spark3-shell are gone:

btullis@stat1004:/etc/hadoop/conf$ spark3-shell --master yarn
Running /opt/conda-analytics/bin/spark-shell $@
SPARK_HOME: /opt/conda-analytics/lib/python3.10/site-packages/pyspark
Using Hadoop client lib jars at 3.2.0, provided by Spark.
PYSPARK_PYTHON=/opt/conda-analytics/bin/python3
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/06/05 11:44:34 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
23/06/05 11:44:42 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
Spark context Web UI available at http://stat1004.eqiad.wmnet:4040
Spark context available as 'sc' (master = yarn, app id = application_1678266962370_533978).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.2
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_362)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Similarly, the warnings and errors from yarn logs with the same spark session have also gone.

btullis@stat1004:/etc/hadoop/conf$ yarn logs -applicationId application_1678266962370_533978
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8

End of LogType:prelaunch.err
******************************************************************************

Container: container_e75_1678266962370_533978_01_000001 on an-worker1093.eqiad.wmnet_8041_1685965525731
LogAggregationType: AGGREGATED
=======================================================================================================
LogType:prelaunch.out
LogLastModifiedTime:Mon Jun 05 11:45:25 +0000 2023
LogLength:70
LogContents:
Setting up env variables
Setting up job resources
Launching container

End of LogType:prelaunch.out
******************************************************************************

Container: container_e75_1678266962370_533978_01_000001 on an-worker1093.eqiad.wmnet_8041_1685965525731
LogAggregationType: AGGREGATED
=======================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Jun 05 11:45:25 +0000 2023
LogLength:5247
LogContents:
23/06/05 11:44:40 INFO SignalUtils: Registering signal handler for TERM
23/06/05 11:44:40 INFO SignalUtils: Registering signal handler for HUP
23/06/05 11:44:40 INFO SignalUtils: Registering signal handler for INT
23/06/05 11:44:41 INFO SecurityManager: Changing view acls to: btullis
23/06/05 11:44:41 INFO SecurityManager: Changing modify acls to: btullis
23/06/05 11:44:41 INFO SecurityManager: Changing view acls groups to: 
23/06/05 11:44:41 INFO SecurityManager: Changing modify acls groups to: 
23/06/05 11:44:41 INFO SecurityManager: SecurityManager: authentication enabled; ui acls disabled; users  with view permissions: Set(btullis); groups with view permissions: Set(); users  with modify permissions: Set(btullis); groups with modify permissions: Set()
23/06/05 11:44:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/06/05 11:44:41 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1678266962370_533978_000001
23/06/05 11:44:41 INFO YarnRMClient: Registering the ApplicationMaster
23/06/05 11:44:42 INFO TransportClientFactory: Successfully created connection to stat1004.eqiad.wmnet/10.64.5.104:12000 after 430 ms (361 ms spent in bootstraps)
23/06/05 11:44:42 INFO SparkHadoopUtil: Updating delegation tokens for current user.
23/06/05 11:44:42 INFO ApplicationMaster: Preparing Local resources
23/06/05 11:44:43 INFO ApplicationMaster: 
===============================================================================
Default YARN executor launch context:
  env:
    CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>{{PWD}}/__spark_conf__/__hadoop_conf__
    SPARK_YARN_STAGING_DIR -> hdfs://analytics-hadoop/user/btullis/.sparkStaging/application_1678266962370_533978
    SPARK_USER -> btullis
    LD_LIBRARY_PATH -> /usr/lib/hadoop/lib/native
    REQUESTS_CA_BUNDLE -> /etc/ssl/certs/ca-certificates.crt

  command:
    {{JAVA_HOME}}/bin/java \ 
      -server \ 
      -Xmx1024m \ 
      '-Djava.net.useSystemProxies=True' \ 
      -Djava.io.tmpdir={{PWD}}/tmp \ 
      '-Dspark.network.crypto.keyLength=256' \ 
      '-Dspark.network.crypto.enabled=true' \ 
      '-Dspark.network.crypto.keyFactoryAlgorithm=PBKDF2WithHmacSHA256' \ 
      '-Dspark.driver.blockManager.port=13000' \ 
      '-Dspark.ui.port=4040' \ 
      '-Dspark.network.crypto.saslFallback=false' \ 
      '-Dspark.driver.port=12000' \ 
      '-Dspark.authenticate=true' \ 
      '-Dspark.port.maxRetries=100' \ 
      -Dspark.yarn.app.container.log.dir=<LOG_DIR> \ 
      -XX:OnOutOfMemoryError='kill %p' \ 
      org.apache.spark.executor.YarnCoarseGrainedExecutorBackend \ 
      --driver-url \ 
      spark://CoarseGrainedScheduler@stat1004.eqiad.wmnet:12000 \ 
      --executor-id \ 
      <executorId> \ 
      --hostname \ 
      <hostname> \ 
      --cores \ 
      1 \ 
      --app-id \ 
      application_1678266962370_533978 \ 
      --resourceProfileId \ 
      0 \ 
      --user-class-path \ 
      file:$PWD/__app__.jar \ 
      1><LOG_DIR>/stdout \ 
      2><LOG_DIR>/stderr

  resources:
    __spark_libs__ -> resource { scheme: "hdfs" host: "analytics-hadoop" port: -1 file: "/user/spark/share/lib/spark-3.1.2-assembly.jar" } size: 255798851 timestamp: 1683815961934 type: ARCHIVE visibility: PUBLIC
    __spark_conf__ -> resource { scheme: "hdfs" host: "analytics-hadoop" port: -1 file: "/user/btullis/.sparkStaging/application_1678266962370_533978/__spark_conf__.zip" } size: 202650 timestamp: 1685965477500 type: ARCHIVE visibility: PRIVATE

===============================================================================
23/06/05 11:44:43 INFO Utils: Using initial executors = 0, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
23/06/05 11:44:43 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
23/06/05 11:44:43 INFO YarnAllocator: Resource profile 0 doesn't exist, adding it
23/06/05 11:44:43 INFO Configuration: resource-types.xml not found
23/06/05 11:44:43 INFO ResourceUtils: Unable to find 'resource-types.xml'.
23/06/05 11:44:43 INFO ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
23/06/05 11:45:24 INFO ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. stat1004.eqiad.wmnet:12000
23/06/05 11:45:24 INFO ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. stat1004.eqiad.wmnet:12000
23/06/05 11:45:24 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
23/06/05 11:45:24 INFO ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED
23/06/05 11:45:24 INFO AMRMClientImpl: Waiting for application to be successfully unregistered.
23/06/05 11:45:24 INFO ApplicationMaster: Deleting staging directory hdfs://analytics-hadoop/user/btullis/.sparkStaging/application_1678266962370_533978
23/06/05 11:45:24 INFO ShutdownHookManager: Shutdown hook called

End of LogType:stderr
***********************************************************************


End of LogType:stdout
***********************************************************************

Container: container_e75_1678266962370_533978_01_000001 on an-worker1093.eqiad.wmnet_8041_1685965525731
LogAggregationType: AGGREGATED
=======================================================================================================
LogType:container-localizer-syslog
LogLastModifiedTime:Mon Jun 05 11:45:25 +0000 2023
LogLength:184
LogContents:
2023-06-05 11:44:38,507 INFO [main] org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer: Disk Validator: yarn.nodemanager.disk-validator is loaded.

End of LogType:container-localizer-syslog
*******************************************************************************************

I also experienced similar issues to the one above.

I was running my Spark application on stat1004 via Airflow using the analytics-privatedata user.

Airflow logs:

[2023-07-10, 10:52:27 UTC] {skein.py:94} INFO - Constructing skein Client with kwargs: {'principal': 'analytics-privatedata/stat1004.eqiad.wmnet@WIKIMEDIA', 'keytab': '/etc/security/keytabs/analytics-privatedata/analytics-privatedata.keytab'}
[2023-07-10, 10:52:50 UTC] {skein.py:237} INFO - SkeinHook Airflow SparkSkeinSubmitHook skein launcher gdi_equity_landscape_csv__load_csv.load_country_meta_data__20230628 application_1688722260742_14796 status: RUNNING - Waiting until finished.
[2023-07-10, 10:53:05 UTC] {skein.py:271} INFO - SkeinHook Airflow SparkSkeinSubmitHook skein launcher gdi_equity_landscape_csv__load_csv.load_country_meta_data__20230628 application_1688722260742_14796 - YARN application log collection is disabled. To view logs for the YARN App Master, run the following command:
	sudo -u analytics-privatedata yarn logs -appOwner analytics-privatedata -applicationId application_1688722260742_14796
If your App Master launched other YARN applications (e.g. a Spark app), you will need to look at these logs and run a simliar command but with the appropriate YARN application_id.
[2023-07-10, 10:53:05 UTC] {taskinstance.py:1768} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/tmp/ntsako_airflow_home_new_251/.conda/envs/airflow_251/lib/python3.10/site-packages/airflow/providers/apache/spark/operators/spark_submit.py", line 157, in execute
    self._hook.submit(self._application)
  File "/srv/home/ntsako/airflow-dags/wmf_airflow_common/hooks/spark.py", line 430, in submit
    return self._skein_hook.submit()
  File "/srv/home/ntsako/airflow-dags/wmf_airflow_common/hooks/skein.py", line 282, in submit
    raise AirflowException(str(self))
airflow.exceptions.AirflowException: SkeinHook Airflow SparkSkeinSubmitHook skein launcher gdi_equity_landscape_csv__load_csv.load_country_meta_data__20230628 application_1688722260742_14796
[2023-07-10, 10:53:05 UTC] {taskinstance.py:1318} INFO - Marking task as FAILED. dag_id=gdi_equity_landscape_csv, task_id=load_csv.load_country_meta_data, execution_date=20230628T101945, start_date=20230710T105226, end_date=20230710T105305
[2023-07-10, 10:53:06 UTC] {standard_task_runner.py:100} ERROR - Failed to execute job 196 for task load_csv.load_country_meta_data (SkeinHook Airflow SparkSkeinSubmitHook skein launcher gdi_equity_landscape_csv__load_csv.load_country_meta_data__20230628 application_1688722260742_14796; 6526)
[2023-07-10, 10:53:06 UTC] {local_task_job.py:208} INFO - Task exited with return code 1
[2023-07-10, 10:53:06 UTC] {taskinstance.py:2578} INFO - 0 downstream tasks scheduled from follow-on schedule check

yarn logs:

Container: container_e87_1688722260742_14796_01_000001 on analytics1075.eqiad.wmnet_8041_1688986373967
LogAggregationType: AGGREGATED
======================================================================================================
LogType:application.driver.log
LogLastModifiedTime:Mon Jul 10 10:52:53 +0000 2023
LogLength:684
LogContents:
Running /opt/conda-analytics/bin/spark-submit $@
SPARK_HOME: /usr/lib/spark3
Using Hadoop client lib jars at 3.2.0, provided by Spark.
PYSPARK_DRIVER_PYTHON=venv/bin/python
PYSPARK_PYTHON=venv/bin/python
venv/bin/python: can't open file '/var/lib/hadoop/data/e/yarn/local/usercache/analytics-privatedata/appcache/application_1688722260742_14796/container_e87_1688722260742_14796_01_000001/venv/lib/python3.7/site-packages/gdi_source/equity_landscape/loaders/load_csv.py': [Errno 2] No such file or directory
23/07/10 10:52:52 INFO ShutdownHookManager: Shutdown hook called
23/07/10 10:52:52 INFO ShutdownHookManager: Deleting directory /tmp/spark-e35ac1ba-bbff-409e-9933-5e9ccc63d30f

End of LogType:application.driver.log
***************************************************************************************

Container: container_e87_1688722260742_14796_01_000001 on analytics1075.eqiad.wmnet_8041_1688986373967
LogAggregationType: AGGREGATED
======================================================================================================
LogType:application.master.log
LogLastModifiedTime:Mon Jul 10 10:52:53 +0000 2023
LogLength:5166
LogContents:
23/07/10 10:52:48 INFO skein.ApplicationMaster: Starting Skein version 0.8.2
23/07/10 10:52:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/07/10 10:52:48 INFO skein.ApplicationMaster: Running as user analytics-privatedata
23/07/10 10:52:48 INFO conf.Configuration: resource-types.xml not found
23/07/10 10:52:48 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
23/07/10 10:52:48 INFO resource.ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
23/07/10 10:52:48 INFO resource.ResourceUtils: Adding resource type - name = vcores, units = , type = COUNTABLE
23/07/10 10:52:48 INFO skein.ApplicationMaster: Application specification successfully loaded
23/07/10 10:52:49 INFO skein.ApplicationMaster: gRPC server started at analytics1075.eqiad.wmnet:36149
23/07/10 10:52:49 INFO skein.ApplicationMaster: WebUI server started at analytics1075.eqiad.wmnet:44517
23/07/10 10:52:49 INFO skein.ApplicationMaster: Registering application with resource manager
23/07/10 10:52:50 INFO skein.ApplicationMaster: Starting application driver
23/07/10 10:52:52 INFO skein.ApplicationMaster: Shutting down: Application driver failed with exit code 2, see logs for more information.
23/07/10 10:52:52 INFO skein.ApplicationMaster: Unregistering application with status FAILED
23/07/10 10:52:52 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
23/07/10 10:52:52 WARN ipc.Client: Exception encountered while connecting to the server 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
	at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:374)
	at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:629)
	at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:423)
	at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:833)
	at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:829)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1938)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:829)
	at org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:423)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1621)
	at org.apache.hadoop.ipc.Client.call(Client.java:1450)
	at org.apache.hadoop.ipc.Client.call(Client.java:1403)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
	at com.sun.proxy.$Proxy10.delete(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.delete(ClientNamenodeProtocolTranslatorPB.java:572)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
	at com.sun.proxy.$Proxy11.delete(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1621)
	at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:882)
	at org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:879)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:889)
	at com.anaconda.skein.ApplicationMaster.runOnExit(ApplicationMaster.java:530)
	at com.anaconda.skein.ApplicationMaster.access$900(ApplicationMaster.java:73)
	at com.anaconda.skein.ApplicationMaster$5.run(ApplicationMaster.java:553)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
23/07/10 10:52:52 INFO skein.ApplicationMaster: Deleted application directory hdfs://analytics-hadoop/user/analytics-privatedata/.skein/application_1688722260742_14796
23/07/10 10:52:52 INFO skein.ApplicationMaster: WebUI server shut down
23/07/10 10:52:52 INFO skein.ApplicationMaster: gRPC server shut down

End of LogType:application.master.log
***************************************************************************************


End of LogType:prelaunch.err
******************************************************************************

Container: container_e87_1688722260742_14796_01_000001 on analytics1075.eqiad.wmnet_8041_1688986373967
LogAggregationType: AGGREGATED
======================================================================================================
LogType:prelaunch.out
LogLastModifiedTime:Mon Jul 10 10:52:53 +0000 2023
LogLength:70
LogContents:
Setting up env variables
Setting up job resources
Launching container

End of LogType:prelaunch.out
******************************************************************************

Container: container_e87_1688722260742_14796_01_000001 on analytics1075.eqiad.wmnet_8041_1688986373967
LogAggregationType: AGGREGATED
======================================================================================================
LogType:container-localizer-syslog
LogLastModifiedTime:Mon Jul 10 10:52:53 +0000 2023
LogLength:4457
LogContents:
2023-07-10 10:52:35,249 INFO [main] org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer: Disk Validator: yarn.nodemanager.disk-validator is loaded.
2023-07-10 10:52:36,163 WARN [ContainerLocalizer Downloader] org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
	at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:374)
	at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:629)
	at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:423)
	at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:833)
	at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:829)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1938)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:829)
	at org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:423)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1621)
	at org.apache.hadoop.ipc.Client.call(Client.java:1450)
	at org.apache.hadoop.ipc.Client.call(Client.java:1403)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
	at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:800)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
	at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1680)
	at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1524)
	at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1521)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1536)
	at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
	at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
	at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:366)
	at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:364)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1938)
	at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:364)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:242)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:235)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:223)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

End of LogType:container-localizer-syslog
*******************************************************************************************
Gehel triaged this task as Low priority.Oct 18 2023, 8:51 AM
Gehel moved this task from Incoming to Misc on the Data-Platform-SRE board.
BTullis removed a subscriber: ntsako.