When the NDFS namesnodes are in the failed over state, a spark3 session exhibits warnings that seem to indicate that it is connecting to the standby server.
For example, here is the service state of the HDFS namenodes.
btullis@an-master1001:/var/log/hadoop-hdfs$ sudo kerberos-run-command hdfs /usr/bin/hdfs haadmin -getAllServiceState an-master1001.eqiad.wmnet:8040 standby an-master1002.eqiad.wmnet:8040 active
Here is a test spark3-shell session being started on a stat server.
btullis@stat1004:/etc/hadoop/conf$ spark3-shell --master yarn
Running /opt/conda-analytics/bin/spark-shell $@
SPARK_HOME: /opt/conda-analytics/lib/python3.10/site-packages/pyspark
Using Hadoop client lib jars at 3.2.0, provided by Spark.
PYSPARK_PYTHON=/opt/conda-analytics/bin/python3
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/06/05 11:03:20 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
23/06/05 11:03:22 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
23/06/05 11:03:22 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
23/06/05 11:03:23 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
23/06/05 11:03:28 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
Spark context Web UI available at http://stat1004.eqiad.wmnet:4040
Spark context available as 'sc' (master = yarn, app id = application_1678266962370_533858).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.1.2
/_/
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_362)
Type in expressions to have them evaluated.
Type :help for more information.Note the lines:
23/06/05 11:03:22 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
These are shown repeatedly to the user and also appear when the user attempts to view the logs of their application.
btullis@stat1004:/etc/hadoop/conf$ yarn logs -applicationId application_1678266962370_533916
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
End of LogType:prelaunch.err
******************************************************************************
Container: container_e75_1678266962370_533916_01_000001 on an-worker1147.eqiad.wmnet_8041_1685964080990
LogAggregationType: AGGREGATED
=======================================================================================================
LogType:prelaunch.out
LogLastModifiedTime:Mon Jun 05 11:21:21 +0000 2023
LogLength:70
LogContents:
Setting up env variables
Setting up job resources
Launching container
End of LogType:prelaunch.out
******************************************************************************
Container: container_e75_1678266962370_533916_01_000001 on an-worker1147.eqiad.wmnet_8041_1685964080990
LogAggregationType: AGGREGATED
=======================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Jun 05 11:21:21 +0000 2023
LogLength:6340
LogContents:
23/06/05 11:17:25 INFO SignalUtils: Registering signal handler for TERM
23/06/05 11:17:25 INFO SignalUtils: Registering signal handler for HUP
23/06/05 11:17:25 INFO SignalUtils: Registering signal handler for INT
23/06/05 11:17:25 INFO SecurityManager: Changing view acls to: btullis
23/06/05 11:17:25 INFO SecurityManager: Changing modify acls to: btullis
23/06/05 11:17:25 INFO SecurityManager: Changing view acls groups to:
23/06/05 11:17:25 INFO SecurityManager: Changing modify acls groups to:
23/06/05 11:17:25 INFO SecurityManager: SecurityManager: authentication enabled; ui acls disabled; users with view permissions: Set(btullis); groups with view permissions: Set(); users with modify permissions: Set(btullis); groups with modify permissions: Set()
23/06/05 11:17:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/06/05 11:17:25 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1678266962370_533916_000001
23/06/05 11:17:26 INFO YarnRMClient: Registering the ApplicationMaster
23/06/05 11:17:26 INFO TransportClientFactory: Successfully created connection to stat1004.eqiad.wmnet/10.64.5.104:12000 after 256 ms (193 ms spent in bootstraps)
23/06/05 11:17:26 INFO SparkHadoopUtil: Updating delegation tokens for current user.
23/06/05 11:17:26 INFO ApplicationMaster: Preparing Local resources
23/06/05 11:17:27 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
23/06/05 11:17:27 INFO ApplicationMaster:
===============================================================================
Default YARN executor launch context:
env:
CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>{{PWD}}/__spark_conf__/__hadoop_conf__
SPARK_YARN_STAGING_DIR -> hdfs://analytics-hadoop/user/btullis/.sparkStaging/application_1678266962370_533916
SPARK_USER -> btullis
PYTHONPATH -> /opt/conda-analytics/lib/python3.10/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip:/opt/conda-analytics/lib/python3.10/site-packages/pyspark/python/::/srv/deployment/analytics/refinery/python<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.9-src.zip
LD_LIBRARY_PATH -> /usr/lib/hadoop/lib/native
REQUESTS_CA_BUNDLE -> /etc/ssl/certs/ca-certificates.crt
command:
{{JAVA_HOME}}/bin/java \
-server \
-Xmx1024m \
'-Djava.net.useSystemProxies=True' \
-Djava.io.tmpdir={{PWD}}/tmp \
'-Dspark.network.crypto.keyLength=256' \
'-Dspark.network.crypto.enabled=true' \
'-Dspark.network.crypto.keyFactoryAlgorithm=PBKDF2WithHmacSHA256' \
'-Dspark.driver.blockManager.port=13000' \
'-Dspark.ui.port=4040' \
'-Dspark.network.crypto.saslFallback=false' \
'-Dspark.driver.port=12000' \
'-Dspark.authenticate=true' \
'-Dspark.port.maxRetries=100' \
-Dspark.yarn.app.container.log.dir=<LOG_DIR> \
-XX:OnOutOfMemoryError='kill %p' \
org.apache.spark.executor.YarnCoarseGrainedExecutorBackend \
--driver-url \
spark://CoarseGrainedScheduler@stat1004.eqiad.wmnet:12000 \
--executor-id \
<executorId> \
--hostname \
<hostname> \
--cores \
1 \
--app-id \
application_1678266962370_533916 \
--resourceProfileId \
0 \
--user-class-path \
file:$PWD/__app__.jar \
1><LOG_DIR>/stdout \
2><LOG_DIR>/stderr
resources:
pyspark.zip -> resource { scheme: "hdfs" host: "analytics-hadoop" port: -1 file: "/user/btullis/.sparkStaging/application_1678266962370_533916/pyspark.zip" } size: 886596 timestamp: 1685963842629 type: FILE visibility: PRIVATE
__spark_libs__ -> resource { scheme: "hdfs" host: "analytics-hadoop" port: -1 file: "/user/spark/share/lib/spark-3.1.2-assembly.jar" } size: 255798851 timestamp: 1683815961934 type: ARCHIVE visibility: PUBLIC
py4j-0.10.9-src.zip -> resource { scheme: "hdfs" host: "analytics-hadoop" port: -1 file: "/user/btullis/.sparkStaging/application_1678266962370_533916/py4j-0.10.9-src.zip" } size: 41587 timestamp: 1685963842717 type: FILE visibility: PRIVATE
__spark_conf__ -> resource { scheme: "hdfs" host: "analytics-hadoop" port: -1 file: "/user/btullis/.sparkStaging/application_1678266962370_533916/__spark_conf__.zip" } size: 203054 timestamp: 1685963842824 type: ARCHIVE visibility: PRIVATE
===============================================================================
23/06/05 11:17:27 INFO Utils: Using initial executors = 0, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
23/06/05 11:17:27 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
23/06/05 11:17:27 INFO YarnAllocator: Resource profile 0 doesn't exist, adding it
23/06/05 11:17:27 INFO Configuration: resource-types.xml not found
23/06/05 11:17:27 INFO ResourceUtils: Unable to find 'resource-types.xml'.
23/06/05 11:17:27 INFO YarnAllocator: Resource profile 0 doesn't exist, adding it
23/06/05 11:17:27 INFO ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
23/06/05 11:21:19 INFO ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. stat1004.eqiad.wmnet:12000
23/06/05 11:21:19 INFO ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. stat1004.eqiad.wmnet:12000
23/06/05 11:21:19 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
23/06/05 11:21:19 INFO ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED
23/06/05 11:21:19 INFO AMRMClientImpl: Waiting for application to be successfully unregistered.
23/06/05 11:21:19 INFO ApplicationMaster: Deleting staging directory hdfs://analytics-hadoop/user/btullis/.sparkStaging/application_1678266962370_533916
23/06/05 11:21:19 INFO ShutdownHookManager: Shutdown hook called
End of LogType:stderr
***********************************************************************
End of LogType:stdout
***********************************************************************
Container: container_e75_1678266962370_533916_01_000001 on an-worker1147.eqiad.wmnet_8041_1685964080990
LogAggregationType: AGGREGATED
=======================================================================================================
LogType:container-localizer-syslog
LogLastModifiedTime:Mon Jun 05 11:21:21 +0000 2023
LogLength:4457
LogContents:
2023-06-05 11:17:23,651 INFO [main] org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer: Disk Validator: yarn.nodemanager.disk-validator is loaded.
2023-06-05 11:17:24,378 WARN [ContainerLocalizer Downloader] org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:374)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:629)
at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:423)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:833)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:829)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1938)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:829)
at org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:423)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1621)
at org.apache.hadoop.ipc.Client.call(Client.java:1450)
at org.apache.hadoop.ipc.Client.call(Client.java:1403)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:800)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1680)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1524)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1521)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1536)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:366)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:364)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1938)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:364)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:242)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:235)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:223)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
End of LogType:container-localizer-syslog
*******************************************************************************************