Page MenuHomePhabricator
Paste P54524

Spark History Server can't parse large event log files
ActivePublic

Authored by brouberol on Jan 5 2024, 9:21 AM.
Tags
None
Referenced Files
F41663461: shs-strace-large-filtered.log
Jan 10 2024, 1:31 PM
F41663471: Screenshot from 2024-01-10 14-17-42.png
Jan 10 2024, 1:31 PM
F41663455: image.png
Jan 10 2024, 1:31 PM
F41651407: Spark History Server can't parse large event log files
Jan 5 2024, 9:21 AM
Subscribers
I've noticed that large enough (>40MB but it could start lower) spark event log files aren't being parsed successfully by the spark history server:
```
# ~40MB
24/01/05 08:18:00 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_622877 for listing data...
# ~40MB
24/01/05 08:18:00 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_622893.lz4 for listing data...
# ~40KB
24/01/05 08:18:01 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_622725 for listing data...
24/01/05 08:18:01 INFO FsHistoryProvider: Finished parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_622893.lz4
# 40MB
24/01/05 08:18:01 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_622678.lz4 for listing data...
```
For the "large" files, we don't see the `Finished parsing` log, meaning that we're stuck somewhere in https://github.com/apache/spark/blob/fbbcf9434ac070dd4ced4fb9efe32899c6db12a9/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L788-L835
We see the following HDFS audit logs on an-master1001.eqiad.wmnet:
```
2024-01-05 08:20:27,081 INFO FSNamesystem.audit: allowed=true ugi=spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS) ip=/10.67.24.93 cmd=open src=/var/log/spark/application_1695896957545_622877 dst=null perm=null proto=rpc
2024-01-05 08:20:27,082 INFO FSNamesystem.audit: allowed=true ugi=spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS) ip=/10.67.24.93 cmd=open src=/var/log/spark/application_1695896957545_622893.lz4dst=null perm=null proto=rpc
2024-01-05 08:20:27,550 INFO FSNamesystem.audit: allowed=true ugi=spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS) ip=/10.67.24.93 cmd=open src=/var/log/spark/application_1695896957545_622725 dst=null perm=null proto=rpc
2024-01-05 08:20:27,555 INFO FSNamesystem.audit: allowed=true ugi=spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS) ip=/10.67.24.93 cmd=open src=/var/log/spark/application_1695896957545_622893.lz4dst=null perm=null proto=rpc
2024-01-05 08:20:27,605 INFO FSNamesystem.audit: allowed=true ugi=spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS) ip=/10.67.24.93 cmd=open src=/var/log/spark/application_1695896957545_622678.lz4dst=null perm=null proto=rpc
```
So this is not a permission issue. As we have a way to regenerate these "big" files by executing a Spark job from Jupyter, I decided to delete all these files from HDFS, to see what the system looks like when it has nothing to do. To do this, I ran `kill -3 <spark history server PID>` on the worker node the container was scheduled on.
```
24/01/05 08:27:53 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and started at http://spark-history-analytics-hadoop-57774468c-sn6vl:18080
2024-01-05 08:32:44
Full thread dump OpenJDK 64-Bit Server VM (11.0.21+9-post-Debian-1deb11u1 mixed mode, sharing):
Threads class SMR info:
_java_thread_list=0x00007f123c4f9650, length=23, elements={
0x00007f12f8019800, 0x00007f12f8b8e800, 0x00007f12f8b90800, 0x00007f12f8b97800,
0x00007f12f8b99800, 0x00007f12f8b9b800, 0x00007f12f8b9d800, 0x00007f12f8b9f800,
0x00007f12f8bef000, 0x00007f12faf2f800, 0x00007f12fb1c5000, 0x00007f12fb1c8000,
0x00007f12fb1ca000, 0x00007f12fb1cc000, 0x00007f12fb1cd800, 0x00007f12fb1d1000,
0x00007f12fb35f000, 0x00007f12fb368000, 0x00007f123005a800, 0x00007f1240005800,
0x00007f1248001800, 0x00007f1280268000, 0x00007f1248006000
}
"main" #1 prio=5 os_prio=0 cpu=2154.42ms elapsed=292.81s tid=0x00007f12f8019800 nid=0x1f waiting on condition [0x00007f12fdf44000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(java.base@11.0.21/Native Method)
at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:316)
at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
"Reference Handler" #2 daemon prio=10 os_prio=0 cpu=1.31ms elapsed=292.78s tid=0x00007f12f8b8e800 nid=0x26 waiting on condition [0x00007f12c3379000]
java.lang.Thread.State: RUNNABLE
at java.lang.ref.Reference.waitForReferencePendingList(java.base@11.0.21/Native Method)
at java.lang.ref.Reference.processPendingReferences(java.base@11.0.21/Reference.java:241)
at java.lang.ref.Reference$ReferenceHandler.run(java.base@11.0.21/Reference.java:213)
"Finalizer" #3 daemon prio=8 os_prio=0 cpu=1.31ms elapsed=292.78s tid=0x00007f12f8b90800 nid=0x27 in Object.wait() [0x00007f12c3278000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(java.base@11.0.21/Native Method)
- waiting on <0x00000007fa8440d8> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(java.base@11.0.21/ReferenceQueue.java:155)
- waiting to re-lock in wait() <0x00000007fa8440d8> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(java.base@11.0.21/ReferenceQueue.java:176)
at java.lang.ref.Finalizer$FinalizerThread.run(java.base@11.0.21/Finalizer.java:170)
"Signal Dispatcher" #4 daemon prio=9 os_prio=0 cpu=0.30ms elapsed=292.78s tid=0x00007f12f8b97800 nid=0x28 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Service Thread" #5 daemon prio=9 os_prio=0 cpu=0.17ms elapsed=292.78s tid=0x00007f12f8b99800 nid=0x29 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread0" #6 daemon prio=9 os_prio=0 cpu=2210.44ms elapsed=292.78s tid=0x00007f12f8b9b800 nid=0x2a waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
No compile task
"C1 CompilerThread0" #9 daemon prio=9 os_prio=0 cpu=1144.05ms elapsed=292.78s tid=0x00007f12f8b9d800 nid=0x2b waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
No compile task
"Sweeper thread" #10 daemon prio=9 os_prio=0 cpu=0.08ms elapsed=292.78s tid=0x00007f12f8b9f800 nid=0x2c runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Common-Cleaner" #11 daemon prio=8 os_prio=0 cpu=1.74ms elapsed=292.75s tid=0x00007f12f8bef000 nid=0x2e in Object.wait() [0x00007f12c29a9000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(java.base@11.0.21/Native Method)
- waiting on <0x00000007fa866368> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(java.base@11.0.21/ReferenceQueue.java:155)
- waiting to re-lock in wait() <0x00000007fa866368> (a java.lang.ref.ReferenceQueue$Lock)
at jdk.internal.ref.CleanerImpl.run(java.base@11.0.21/CleanerImpl.java:148)
at java.lang.Thread.run(java.base@11.0.21/Thread.java:829)
at jdk.internal.misc.InnocuousThread.run(java.base@11.0.21/InnocuousThread.java:161)
"org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner" #15 daemon prio=5 os_prio=0 cpu=0.24ms elapsed=291.03s tid=0x00007f12faf2f800 nid=0x3c in Object.wait() [0x00007f12c0dd4000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(java.base@11.0.21/Native Method)
- waiting on <0x00000007fa888310> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(java.base@11.0.21/ReferenceQueue.java:155)
- waiting to re-lock in wait() <0x00000007fa888310> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(java.base@11.0.21/ReferenceQueue.java:176)
at org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:3712)
at java.lang.Thread.run(java.base@11.0.21/Thread.java:829)
"HistoryServerUI-17" #17 daemon prio=5 os_prio=0 cpu=19.29ms elapsed=290.70s tid=0x00007f12fb1c5000 nid=0x3d runnable [0x00007f12c06d3000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPoll.wait(java.base@11.0.21/Native Method)
at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@11.0.21/EPollSelectorImpl.java:120)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@11.0.21/SelectorImpl.java:124)
- locked <0x00000007fa866580> (a sun.nio.ch.Util$2)
- locked <0x00000007fa866528> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(java.base@11.0.21/SelectorImpl.java:141)
at org.sparkproject.jetty.io.ManagedSelector.nioSelect(ManagedSelector.java:183)
at org.sparkproject.jetty.io.ManagedSelector.select(ManagedSelector.java:190)
at org.sparkproject.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:606)
at org.sparkproject.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:543)
at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.produceTask(EatWhatYouKill.java:362)
at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:186)
at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
at org.sparkproject.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)
at org.sparkproject.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
at org.sparkproject.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
at java.lang.Thread.run(java.base@11.0.21/Thread.java:829)
"HistoryServerUI-19-acceptor-0@4594c658-ServerConnector@4b53209a{HTTP/1.1, (http/1.1)}{0.0.0.0:18080}" #19 daemon prio=3 os_prio=0 cpu=17.86ms elapsed=290.69s tid=0x00007f12fb1c8000 nid=0x3f runnable [0x00007f12c04d1000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.ServerSocketChannelImpl.accept0(java.base@11.0.21/Native Method)
at sun.nio.ch.ServerSocketChannelImpl.accept(java.base@11.0.21/ServerSocketChannelImpl.java:533)
at sun.nio.ch.ServerSocketChannelImpl.accept(java.base@11.0.21/ServerSocketChannelImpl.java:285)
at org.sparkproject.jetty.server.ServerConnector.accept(ServerConnector.java:388)
at org.sparkproject.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:704)
at org.sparkproject.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
at org.sparkproject.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
at java.lang.Thread.run(java.base@11.0.21/Thread.java:829)
"HistoryServerUI-20" #20 daemon prio=5 os_prio=0 cpu=26.00ms elapsed=290.69s tid=0x00007f12fb1ca000 nid=0x40 runnable [0x00007f12c03d0000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPoll.wait(java.base@11.0.21/Native Method)
at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@11.0.21/EPollSelectorImpl.java:120)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@11.0.21/SelectorImpl.java:124)
- locked <0x00000007fa8443f0> (a sun.nio.ch.Util$2)
- locked <0x00000007fa844398> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(java.base@11.0.21/SelectorImpl.java:141)
at org.sparkproject.jetty.io.ManagedSelector.nioSelect(ManagedSelector.java:183)
at org.sparkproject.jetty.io.ManagedSelector.select(ManagedSelector.java:190)
at org.sparkproject.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:606)
at org.sparkproject.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:543)
at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.produceTask(EatWhatYouKill.java:362)
at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:186)
at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
at org.sparkproject.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)
at org.sparkproject.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
at org.sparkproject.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
at java.lang.Thread.run(java.base@11.0.21/Thread.java:829)
"HistoryServerUI-21" #21 daemon prio=5 os_prio=0 cpu=28.79ms elapsed=290.70s tid=0x00007f12fb1cc000 nid=0x41 waiting on condition [0x00007f12c02cf000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.21/Native Method)
- parking to wait for <0x00000007fa888530> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.21/LockSupport.java:234)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.21/AbstractQueuedSynchronizer.java:2123)
at org.sparkproject.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:382)
at org.sparkproject.jetty.util.thread.QueuedThreadPool$Runner.idleJobPoll(QueuedThreadPool.java:974)
at org.sparkproject.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1018)
at java.lang.Thread.run(java.base@11.0.21/Thread.java:829)
"HistoryServerUI-22" #22 daemon prio=5 os_prio=0 cpu=52.92ms elapsed=290.69s tid=0x00007f12fb1cd800 nid=0x42 waiting on condition [0x00007f12c01ce000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.21/Native Method)
- parking to wait for <0x00000007fa823470> (a java.util.concurrent.SynchronousQueue$TransferStack)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.21/LockSupport.java:234)
at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(java.base@11.0.21/SynchronousQueue.java:462)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(java.base@11.0.21/SynchronousQueue.java:361)
at java.util.concurrent.SynchronousQueue.poll(java.base@11.0.21/SynchronousQueue.java:937)
at org.sparkproject.jetty.util.thread.ReservedThreadExecutor$ReservedThread.reservedWait(ReservedThreadExecutor.java:324)
at org.sparkproject.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:399)
at org.sparkproject.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
at org.sparkproject.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
at java.lang.Thread.run(java.base@11.0.21/Thread.java:829)
"HistoryServerUI-24" #24 daemon prio=5 os_prio=0 cpu=28.85ms elapsed=290.69s tid=0x00007f12fb1d1000 nid=0x44 waiting on condition [0x00007f1247efd000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.21/Native Method)
- parking to wait for <0x00000007fa823470> (a java.util.concurrent.SynchronousQueue$TransferStack)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.21/LockSupport.java:234)
at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(java.base@11.0.21/SynchronousQueue.java:462)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(java.base@11.0.21/SynchronousQueue.java:361)
at java.util.concurrent.SynchronousQueue.poll(java.base@11.0.21/SynchronousQueue.java:937)
at org.sparkproject.jetty.util.thread.ReservedThreadExecutor$ReservedThread.reservedWait(ReservedThreadExecutor.java:324)
at org.sparkproject.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:399)
at org.sparkproject.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
at org.sparkproject.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
at java.lang.Thread.run(java.base@11.0.21/Thread.java:829)
"IPC Parameter Sending Thread #0" #26 daemon prio=5 os_prio=0 cpu=60.27ms elapsed=290.51s tid=0x00007f12fb35f000 nid=0x46 waiting on condition [0x00007f12476d1000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.21/Native Method)
- parking to wait for <0x00000007fa50d290> (a java.util.concurrent.SynchronousQueue$TransferStack)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.21/LockSupport.java:234)
at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(java.base@11.0.21/SynchronousQueue.java:462)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(java.base@11.0.21/SynchronousQueue.java:361)
at java.util.concurrent.SynchronousQueue.poll(java.base@11.0.21/SynchronousQueue.java:937)
at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.21/ThreadPoolExecutor.java:1053)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.21/ThreadPoolExecutor.java:1114)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.21/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base@11.0.21/Thread.java:829)
"spark-history-task-0" #27 daemon prio=5 os_prio=0 cpu=71.26ms elapsed=290.49s tid=0x00007f12fb368000 nid=0x47 waiting on condition [0x00007f12473d0000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.21/Native Method)
- parking to wait for <0x00000007fa6fc4a0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.21/LockSupport.java:234)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.21/AbstractQueuedSynchronizer.java:2123)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(java.base@11.0.21/ScheduledThreadPoolExecutor.java:1182)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(java.base@11.0.21/ScheduledThreadPoolExecutor.java:899)
at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.21/ThreadPoolExecutor.java:1054)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.21/ThreadPoolExecutor.java:1114)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.21/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base@11.0.21/Thread.java:829)
"IPC Client (137659163) connection to an-master1001.eqiad.wmnet/10.64.5.26:8020 from spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA" #28 daemon prio=5 os_prio=0 cpu=71.32ms elapsed=280.46s tid=0x00007f123005a800 nid=0x48 in Object.wait() [0x00007f12477d2000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(java.base@11.0.21/Native Method)
- waiting on <0x00000007ff36e8f0> (a org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:1053)
- waiting to re-lock in wait() <0x00000007ff36e8f0> (a org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1097)
"HistoryServerUI-JettyScheduler-1" #29 daemon prio=5 os_prio=0 cpu=6.12ms elapsed=273.97s tid=0x00007f1240005800 nid=0x49 waiting on condition [0x00007f12c284c000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.21/Native Method)
- parking to wait for <0x00000007fa823180> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.21/LockSupport.java:234)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.21/AbstractQueuedSynchronizer.java:2123)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(java.base@11.0.21/ScheduledThreadPoolExecutor.java:1182)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(java.base@11.0.21/ScheduledThreadPoolExecutor.java:899)
at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.21/ThreadPoolExecutor.java:1054)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.21/ThreadPoolExecutor.java:1114)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.21/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base@11.0.21/Thread.java:829)
"HistoryServerUI-30" #30 daemon prio=5 os_prio=0 cpu=39.29ms elapsed=273.93s tid=0x00007f1248001800 nid=0x4a waiting on condition [0x00007f12c274b000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.21/Native Method)
- parking to wait for <0x00000007fa823470> (a java.util.concurrent.SynchronousQueue$TransferStack)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.21/LockSupport.java:234)
at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(java.base@11.0.21/SynchronousQueue.java:462)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(java.base@11.0.21/SynchronousQueue.java:361)
at java.util.concurrent.SynchronousQueue.poll(java.base@11.0.21/SynchronousQueue.java:937)
at org.sparkproject.jetty.util.thread.ReservedThreadExecutor$ReservedThread.reservedWait(ReservedThreadExecutor.java:324)
at org.sparkproject.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:399)
at org.sparkproject.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
at org.sparkproject.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
at java.lang.Thread.run(java.base@11.0.21/Thread.java:829)
"HistoryServerUI-31" #31 daemon prio=5 os_prio=0 cpu=38.29ms elapsed=273.93s tid=0x00007f1280268000 nid=0x4b runnable [0x00007f1247bfc000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPoll.wait(java.base@11.0.21/Native Method)
at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@11.0.21/EPollSelectorImpl.java:120)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@11.0.21/SelectorImpl.java:124)
- locked <0x00000007fa8229b0> (a sun.nio.ch.Util$2)
- locked <0x00000007fa822958> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(java.base@11.0.21/SelectorImpl.java:141)
at org.sparkproject.jetty.io.ManagedSelector.nioSelect(ManagedSelector.java:183)
at org.sparkproject.jetty.io.ManagedSelector.select(ManagedSelector.java:190)
at org.sparkproject.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:606)
at org.sparkproject.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:543)
at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.produceTask(EatWhatYouKill.java:362)
at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:186)
at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
at org.sparkproject.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)
at org.sparkproject.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
at org.sparkproject.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
at java.lang.Thread.run(java.base@11.0.21/Thread.java:829)
"HistoryServerUI-34" #34 daemon prio=5 os_prio=0 cpu=7.23ms elapsed=63.98s tid=0x00007f1248006000 nid=0x51 runnable [0x00007f12c05d2000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPoll.wait(java.base@11.0.21/Native Method)
at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@11.0.21/EPollSelectorImpl.java:120)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@11.0.21/SelectorImpl.java:124)
- locked <0x00000007fa8cc488> (a sun.nio.ch.Util$2)
- locked <0x00000007fa8cc430> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(java.base@11.0.21/SelectorImpl.java:141)
at org.sparkproject.jetty.io.ManagedSelector.nioSelect(ManagedSelector.java:183)
at org.sparkproject.jetty.io.ManagedSelector.select(ManagedSelector.java:190)
at org.sparkproject.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:606)
at org.sparkproject.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:543)
at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.produceTask(EatWhatYouKill.java:362)
at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:186)
at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
at org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
at org.sparkproject.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)
at org.sparkproject.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
at org.sparkproject.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
at java.lang.Thread.run(java.base@11.0.21/Thread.java:829)
"VM Thread" os_prio=0 cpu=93.44ms elapsed=292.79s tid=0x00007f12f8b8b800 nid=0x25 runnable
"GC Thread#0" os_prio=0 cpu=55.00ms elapsed=292.81s tid=0x00007f12f8031000 nid=0x20 runnable
"GC Thread#1" os_prio=0 cpu=55.63ms elapsed=291.32s tid=0x00007f12b4001000 nid=0x34 runnable
"GC Thread#2" os_prio=0 cpu=46.13ms elapsed=291.32s tid=0x00007f12b4002000 nid=0x35 runnable
"GC Thread#3" os_prio=0 cpu=59.08ms elapsed=291.32s tid=0x00007f12b4003800 nid=0x36 runnable
"GC Thread#4" os_prio=0 cpu=58.77ms elapsed=291.32s tid=0x00007f12b4005800 nid=0x37 runnable
"GC Thread#5" os_prio=0 cpu=55.57ms elapsed=291.32s tid=0x00007f12b4007000 nid=0x38 runnable
"GC Thread#6" os_prio=0 cpu=55.33ms elapsed=291.32s tid=0x00007f12b4009000 nid=0x39 runnable
"GC Thread#7" os_prio=0 cpu=54.61ms elapsed=291.32s tid=0x00007f12b400a800 nid=0x3a runnable
"G1 Main Marker" os_prio=0 cpu=1.38ms elapsed=292.81s tid=0x00007f12f8053000 nid=0x21 runnable
"G1 Conc#0" os_prio=0 cpu=51.00ms elapsed=292.81s tid=0x00007f12f8055000 nid=0x22 runnable
"G1 Conc#1" os_prio=0 cpu=45.61ms elapsed=291.31s tid=0x00007f12c4001000 nid=0x3b runnable
"G1 Refine#0" os_prio=0 cpu=0.16ms elapsed=292.80s tid=0x00007f12f8b58000 nid=0x23 runnable
"G1 Young RemSet Sampling" os_prio=0 cpu=45.92ms elapsed=292.80s tid=0x00007f12f8b5a000 nid=0x24 runnable
"VM Periodic Task Thread" os_prio=0 cpu=199.62ms elapsed=292.76s tid=0x00007f12f8be8800 nid=0x2d waiting on condition
JNI global refs: 20, weak refs: 0
Heap
garbage-first heap total 8388608K, used 75734K [0x0000000600000000, 0x0000000800000000)
region size 4096K, 19 young (77824K), 6 survivors (24576K)
Metaspace used 39853K, capacity 40854K, committed 41216K, reserved 1085440K
class space used 5315K, capacity 5794K, committed 5888K, reserved 1048576K
```
We see the following thread:
```
"IPC Client (137659163) connection to an-master1001.eqiad.wmnet/10.64.5.26:8020 from spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA" #28 daemon prio=5 os_prio=0 cpu=71.32ms elapsed=280.46s tid=0x00007f123005a800 nid=0x48 in Object.wait() [0x00007f12477d2000]
```
I initially thought that thread was blocking in the case of the "large" files, but it seems that it's a long running connection, not a data transfer thread.
I had already taken another thread dump of the spark history server (at a time where 2 large files were in HDFS), and wrote a small python script that nicely formats the thread names and their associate state:
```python
import sys
logfile = sys.argv[1]
thread_states = {}
for line in open(logfile):
if line.startswith('"'):
thread_name = line.split('"')[1]
thread_states[thread_name] = None
elif line.strip().startswith('java.lang.Thread.State: '):
thread_states[thread_name] = line.split(": ")[1].strip()
thread_states = {k: v for k, v in thread_states.items() if v is not None}
for thread, state in sorted(thread_states.items()):
print(thread, state)
```
This allowed me to print a diff of both JVM states:
```diff
--- thread-state-empty-hdfs.txt 2024-01-05 10:08:18
+++ thread-state-shs.txt 2024-01-05 10:08:27
@@ -2,22 +2,25 @@
C2 CompilerThread0 RUNNABLE
Common-Cleaner TIMED_WAITING (on object monitor)
Finalizer WAITING (on object monitor)
-HistoryServerUI-17 RUNNABLE
-HistoryServerUI-19-acceptor-0@4594c658-ServerConnector@4b53209a{HTTP/1.1, (http/1.1)}{0.0.0.0:18080} RUNNABLE
-HistoryServerUI-20 RUNNABLE
-HistoryServerUI-21 TIMED_WAITING (parking)
-HistoryServerUI-22 TIMED_WAITING (parking)
-HistoryServerUI-24 TIMED_WAITING (parking)
-HistoryServerUI-30 TIMED_WAITING (parking)
-HistoryServerUI-31 RUNNABLE
-HistoryServerUI-34 RUNNABLE
+HistoryServerUI-24-acceptor-0@49f6bda8-ServerConnector@47315ae6{HTTP/1.1, (http/1.1)}{0.0.0.0:18080} RUNNABLE
+HistoryServerUI-44 RUNNABLE
+HistoryServerUI-46 TIMED_WAITING (parking)
+HistoryServerUI-49 RUNNABLE
+HistoryServerUI-51 TIMED_WAITING (parking)
+HistoryServerUI-52 TIMED_WAITING (parking)
+HistoryServerUI-53 TIMED_WAITING (parking)
+HistoryServerUI-54 RUNNABLE
+HistoryServerUI-55 RUNNABLE
HistoryServerUI-JettyScheduler-1 TIMED_WAITING (parking)
-IPC Client (137659163) connection to an-master1001.eqiad.wmnet/10.64.5.26:8020 from spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA TIMED_WAITING (on object monitor)
+IPC Client (1196716338) connection to an-master1001.eqiad.wmnet/10.64.5.26:8020 from spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA TIMED_WAITING (on object monitor)
IPC Parameter Sending Thread #0 TIMED_WAITING (parking)
Reference Handler RUNNABLE
Service Thread RUNNABLE
Signal Dispatcher RUNNABLE
Sweeper thread RUNNABLE
+log-replay-executor-0 WAITING (parking)
+log-replay-executor-1 WAITING (parking)
main TIMED_WAITING (sleeping)
org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner WAITING (on object monitor)
+org.apache.hadoop.hdfs.PeerCache@62a7c7ba TIMED_WAITING (sleeping)
spark-history-task-0 TIMED_WAITING (parking)
```
We see that 2 `log-replay-executor` threads are in WAITING state:
```diff
+log-replay-executor-0 WAITING (parking)
+log-replay-executor-1 WAITING (parking)
```
I'm not sure whether they are blocked on receiving the file, or if the receiving of the file failed, and the threads are waiting for something else to do.

Event Timeline

I created a large spark event log by submitting the following spark SQL job:

import wmfdata as wmf

ss = wmf.spark.create_custom_session(
    master="yarn",
    spark_config={
        "spark.driver.memory": "2g",
        "spark.dynamicAllocation.maxExecutors": 64,
        "spark.executor.memory": "8g",
        "spark.executor.cores": 4,
        "spark.sql.shuffle.partitions": 256,
        "spark.eventLog.enabled": True,
        "spark.eventLog.dir": "hdfs:///var/log/spark",
        "spark.yarn.historyServer.address": "yarn.wikimedia.org",
        "spark.eventLog.compress": True,
    }
)
ss.sql("""SELECT count(1) as count from wmf.webrequest WHERE year=2024 AND month=1 AND day=4""").show(1)
ss.stop()

This created the following event log in HDFS, which is only 6MB compressed:

-rw-rw----   3 brouberol analytics-privatedata-users    6481357 2024-01-05 09:50 /var/log/spark/application_1695896957545_627058.lz4

I then see the following log entries in the spark history server logs:

24/01/05 09:49:36 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_627058.lz4.inprogress for listing data...
24/01/05 09:49:46 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_627058.lz4.inprogress for listing data...
24/01/05 09:49:56 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_627058.lz4.inprogress for listing data...
24/01/05 09:50:06 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_627058.lz4.inprogress for listing data...
24/01/05 09:50:16 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_627058.lz4.inprogress for listing data...
24/01/05 09:50:26 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_627058.lz4 for listing data...

Note: we don't see any Finished parsing hdfs://... log at all.

We see the following HDFS audit events on the an-master1001.eqiad.wmnet node:

hadoop-hdfs-namenode-an-master1001.log:2024-01-05 09:49:17,092 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_2418027244_1344374340, replicas=10.64.142.2:50010, 10.64.142.3:50010, 10.64.143.2:50010 for /var/log/spark/application_1695896957545_627058.lz4.inprogress
hadoop-hdfs-namenode-an-master1001.log:2024-01-05 09:49:17,148 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* fsync: /var/log/spark/application_1695896957545_627058.lz4.inprogress for DFSClient_NONMAPREDUCE_-820340752_28
hadoop-hdfs-namenode-an-master1001.log:2024-01-05 09:50:23,872 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /var/log/spark/application_1695896957545_627058.lz4.inprogress is closed by DFSClient_NONMAPREDUCE_-820340752_28
hdfs-audit.log:2024-01-05 09:49:17,067 INFO FSNamesystem.audit: allowed=true	ugi=brouberol@WIKIMEDIA (auth:TOKEN)	ip=/10.64.53.30	cmd=create	src=/var/log/spark/application_1695896957545_627058.lz4.inprogress	dst=null	perm=brouberol:analytics-privatedata-users:rw-r-----	proto=rpc
hdfs-audit.log:2024-01-05 09:49:17,074 INFO FSNamesystem.audit: allowed=true	ugi=brouberol@WIKIMEDIA (auth:TOKEN)	ip=/10.64.53.30	cmd=setPermission	src=/var/log/spark/application_1695896957545_627058.lz4.inprogress	dst=null	perm=brouberol:analytics-privatedata-users:rw-rw----	proto=rpc
hdfs-audit.log:2024-01-05 09:49:26,400 INFO FSNamesystem.audit: allowed=true	ugi=spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)	ip=/10.67.29.202	cmd=open	src=/var/log/spark/application_1695896957545_627058.lz4.inprogress	dst=null	perm=null	proto=rpc
hdfs-audit.log:2024-01-05 09:49:36,403 INFO FSNamesystem.audit: allowed=true	ugi=spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)	ip=/10.67.29.202	cmd=open	src=/var/log/spark/application_1695896957545_627058.lz4.inprogress	dst=null	perm=null	proto=rpc
hdfs-audit.log:2024-01-05 09:49:36,421 INFO FSNamesystem.audit: allowed=true	ugi=spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)	ip=/10.67.29.202	cmd=open	src=/var/log/spark/application_1695896957545_627058.lz4.inprogress	dst=null	perm=null	proto=rpc
hdfs-audit.log:2024-01-05 09:49:46,424 INFO FSNamesystem.audit: allowed=true	ugi=spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)	ip=/10.67.29.202	cmd=open	src=/var/log/spark/application_1695896957545_627058.lz4.inprogress	dst=null	perm=null	proto=rpc
hdfs-audit.log:2024-01-05 09:49:46,439 INFO FSNamesystem.audit: allowed=true	ugi=spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)	ip=/10.67.29.202	cmd=open	src=/var/log/spark/application_1695896957545_627058.lz4.inprogress	dst=null	perm=null	proto=rpc
hdfs-audit.log:2024-01-05 09:49:56,442 INFO FSNamesystem.audit: allowed=true	ugi=spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)	ip=/10.67.29.202	cmd=open	src=/var/log/spark/application_1695896957545_627058.lz4.inprogress	dst=null	perm=null	proto=rpc
hdfs-audit.log:2024-01-05 09:49:56,463 INFO FSNamesystem.audit: allowed=true	ugi=spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)	ip=/10.67.29.202	cmd=open	src=/var/log/spark/application_1695896957545_627058.lz4.inprogress	dst=null	perm=null	proto=rpc
hdfs-audit.log:2024-01-05 09:50:06,464 INFO FSNamesystem.audit: allowed=true	ugi=spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)	ip=/10.67.29.202	cmd=open	src=/var/log/spark/application_1695896957545_627058.lz4.inprogress	dst=null	perm=null	proto=rpc
hdfs-audit.log:2024-01-05 09:50:06,478 INFO FSNamesystem.audit: allowed=true	ugi=spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)	ip=/10.67.29.202	cmd=open	src=/var/log/spark/application_1695896957545_627058.lz4.inprogress	dst=null	perm=null	proto=rpc
hdfs-audit.log:2024-01-05 09:50:16,481 INFO FSNamesystem.audit: allowed=true	ugi=spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)	ip=/10.67.29.202	cmd=open	src=/var/log/spark/application_1695896957545_627058.lz4.inprogress	dst=null	perm=null	proto=rpc
hdfs-audit.log:2024-01-05 09:50:16,495 INFO FSNamesystem.audit: allowed=true	ugi=spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)	ip=/10.67.29.202	cmd=open	src=/var/log/spark/application_1695896957545_627058.lz4.inprogress	dst=null	perm=null	proto=rpc
hdfs-audit.log:2024-01-05 09:50:23,877 INFO FSNamesystem.audit: allowed=true	ugi=brouberol@WIKIMEDIA (auth:TOKEN)	ip=/10.64.53.30	cmd=getfileinfo	src=/var/log/spark/application_1695896957545_627058.lz4	dst=null	perm=null	proto=rpc
hdfs-audit.log:2024-01-05 09:50:23,879 INFO FSNamesystem.audit: allowed=true	ugi=brouberol@WIKIMEDIA (auth:TOKEN)	ip=/10.64.53.30	cmd=rename	src=/var/log/spark/application_1695896957545_627058.lz4.inprogress	dst=/var/log/spark/application_1695896957545_627058.lz4	perm=brouberol:analytics-privatedata-users:rw-rw----	proto=rpc
hdfs-audit.log:2024-01-05 09:50:23,882 INFO FSNamesystem.audit: allowed=true	ugi=brouberol@WIKIMEDIA (auth:TOKEN)	ip=/10.64.53.30	cmd=setTimes	src=/var/log/spark/application_1695896957545_627058.lz4	dst=null	perm=brouberol:analytics-privatedata-users:rw-rw----	proto=rpc
hdfs-audit.log:2024-01-05 09:50:26,498 INFO FSNamesystem.audit: allowed=true	ugi=spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)	ip=/10.67.29.202	cmd=open	src=/var/log/spark/application_1695896957545_627058.lz4	dst=null	perm=null	proto=rpc

The last one indicates that the spark history server attempts to open the file, which is allowed (log reformatted for convenience)

hdfs-audit.log:2024-01-05 09:50:26,498 INFO FSNamesystem.audit: 
  allowed=true
  ugi=spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)
  ip=/10.67.29.202
  cmd=open
  src=/var/log/spark/application_1695896957545_627058.lz4
  dst=null
  perm=null
  proto=rpc

HDFS metadata does not seem to be the issue. The file transfer does not, however, occur between the an-master host and container. The actual data is being transferred from a an-worker host, on TCP port 50010.

I ran a "small" spark job to see what a successful transfer looked liked in a tcpdump.

When the SHS logged

24/01/05 10:09:57 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_627122.lz4 for listing data...
24/01/05 10:09:57 INFO FsHistoryProvider: Finished parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_627122.lz4

the tcpdump tcp and dst 10.67.29.202 and port 50010 command I had running on the dse-k8s-worker host on which the SHS container was running did output the folloowing packets at the same time

10:09:57.222051 IP an-worker1138.eqiad.wmnet.50010 > 10.67.29.202.46452: Flags [S.], seq 882725287, ack 3066083696, win 43440, options [mss 1460,sackOK,TS val 299926694 ecr 3251168128,nop,wscale 9], length 0
10:09:57.222591 IP an-worker1138.eqiad.wmnet.50010 > 10.67.29.202.46452: Flags [.], ack 2, win 85, options [nop,nop,TS val 299926695 ecr 3251168128], length 0
10:09:57.222614 IP an-worker1138.eqiad.wmnet.50010 > 10.67.29.202.46452: Flags [.], ack 3, win 85, options [nop,nop,TS val 299926695 ecr 3251168128], length 0
10:09:57.222626 IP an-worker1138.eqiad.wmnet.50010 > 10.67.29.202.46452: Flags [.], ack 4, win 85, options [nop,nop,TS val 299926695 ecr 3251168128], length 0
10:09:57.222644 IP an-worker1138.eqiad.wmnet.50010 > 10.67.29.202.46452: Flags [.], ack 5, win 85, options [nop,nop,TS val 299926695 ecr 3251168128], length 0
10:09:57.222663 IP an-worker1138.eqiad.wmnet.50010 > 10.67.29.202.46452: Flags [.], ack 10, win 85, options [nop,nop,TS val 299926695 ecr 3251168128], length 0
10:09:57.222915 IP an-worker1138.eqiad.wmnet.50010 > 10.67.29.202.46452: Flags [P.], seq 1:151, ack 10, win 85, options [nop,nop,TS val 299926695 ecr 3251168128], length 150
10:09:57.223851 IP an-worker1138.eqiad.wmnet.50010 > 10.67.29.202.46452: Flags [P.], seq 151:317, ack 407, win 85, options [nop,nop,TS val 299926696 ecr 3251168129], length 166
10:09:57.225631 IP an-worker1138.eqiad.wmnet.50010 > 10.67.29.202.46452: Flags [P.], seq 317:331, ack 647, win 85, options [nop,nop,TS val 299926698 ecr 3251168131], length 14
10:09:57.225713 IP an-worker1138.eqiad.wmnet.50010 > 10.67.29.202.46452: Flags [P.], seq 331:7571, ack 647, win 85, options [nop,nop,TS val 299926698 ecr 3251168131], length 7240
10:09:57.225714 IP an-worker1138.eqiad.wmnet.50010 > 10.67.29.202.46452: Flags [P.], seq 7571:8523, ack 647, win 85, options [nop,nop,TS val 299926698 ecr 3251168131], length 952
10:09:57.225717 IP an-worker1138.eqiad.wmnet.50010 > 10.67.29.202.46452: Flags [P.], seq 8523:12867, ack 647, win 85, options [nop,nop,TS val 299926698 ecr 3251168131], length 4344
10:09:57.226037 IP an-worker1138.eqiad.wmnet.50010 > 10.67.29.202.46452: Flags [P.], seq 12867:27347, ack 647, win 85, options [nop,nop,TS val 299926698 ecr 3251168132], length 14480
10:09:57.226041 IP an-worker1138.eqiad.wmnet.50010 > 10.67.29.202.46452: Flags [P.], seq 27347:33139, ack 647, win 85, options [nop,nop,TS val 299926698 ecr 3251168132], length 5792
10:09:57.226045 IP an-worker1138.eqiad.wmnet.50010 > 10.67.29.202.46452: Flags [P.], seq 33139:40746, ack 647, win 85, options [nop,nop,TS val 299926698 ecr 3251168132], length 7607
10:09:57.231328 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [S.], seq 1665626738, ack 823922232, win 43440, options [mss 1460,sackOK,TS val 3147456826 ecr 1844699159,nop,wscale 9], length 0
10:09:57.231757 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [.], ack 2, win 85, options [nop,nop,TS val 3147456826 ecr 1844699159], length 0
10:09:57.231780 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [.], ack 3, win 85, options [nop,nop,TS val 3147456826 ecr 1844699159], length 0
10:09:57.231802 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [.], ack 4, win 85, options [nop,nop,TS val 3147456826 ecr 1844699159], length 0
10:09:57.231802 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [.], ack 5, win 85, options [nop,nop,TS val 3147456826 ecr 1844699159], length 0
10:09:57.231818 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [.], ack 10, win 85, options [nop,nop,TS val 3147456826 ecr 1844699160], length 0
10:09:57.232229 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [P.], seq 1:151, ack 10, win 85, options [nop,nop,TS val 3147456827 ecr 1844699160], length 150
10:09:57.233563 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [P.], seq 151:317, ack 407, win 85, options [nop,nop,TS val 3147456828 ecr 1844699161], length 166
10:09:57.235620 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [P.], seq 317:331, ack 647, win 85, options [nop,nop,TS val 3147456830 ecr 1844699163], length 14
10:09:57.235711 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [P.], seq 331:7571, ack 647, win 85, options [nop,nop,TS val 3147456830 ecr 1844699163], length 7240
10:09:57.235713 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [P.], seq 7571:8523, ack 647, win 85, options [nop,nop,TS val 3147456830 ecr 1844699163], length 952
10:09:57.235715 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [P.], seq 8523:12867, ack 647, win 85, options [nop,nop,TS val 3147456830 ecr 1844699163], length 4344
10:09:57.235929 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [P.], seq 12867:27347, ack 647, win 85, options [nop,nop,TS val 3147456830 ecr 1844699164], length 14480
10:09:57.235932 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [P.], seq 27347:33139, ack 647, win 85, options [nop,nop,TS val 3147456830 ecr 1844699164], length 5792
10:09:57.235936 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [P.], seq 33139:40746, ack 647, win 85, options [nop,nop,TS val 3147456830 ecr 1844699164], length 7607
10:09:57.267101 IP an-worker1138.eqiad.wmnet.50010 > 10.67.29.202.46452: Flags [.], ack 650, win 85, options [nop,nop,TS val 299926740 ecr 3251168132], length 0
10:09:57.279279 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [.], ack 650, win 85, options [nop,nop,TS val 3147456874 ecr 1844699164], length 0

The data transfer packets being:

10:09:57.232229 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [P.], seq 1:151, ack 10, win 85, options [nop,nop,TS val 3147456827 ecr 1844699160], length 150
10:09:57.233563 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [P.], seq 151:317, ack 407, win 85, options [nop,nop,TS val 3147456828 ecr 1844699161], length 166
10:09:57.235620 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [P.], seq 317:331, ack 647, win 85, options [nop,nop,TS val 3147456830 ecr 1844699163], length 14
10:09:57.235711 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [P.], seq 331:7571, ack 647, win 85, options [nop,nop,TS val 3147456830 ecr 1844699163], length 7240
10:09:57.235713 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [P.], seq 7571:8523, ack 647, win 85, options [nop,nop,TS val 3147456830 ecr 1844699163], length 952
10:09:57.235715 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [P.], seq 8523:12867, ack 647, win 85, options [nop,nop,TS val 3147456830 ecr 1844699163], length 4344
10:09:57.235929 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [P.], seq 12867:27347, ack 647, win 85, options [nop,nop,TS val 3147456830 ecr 1844699164], length 14480
10:09:57.235932 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [P.], seq 27347:33139, ack 647, win 85, options [nop,nop,TS val 3147456830 ecr 1844699164], length 5792
10:09:57.235936 IP an-worker1134.eqiad.wmnet.50010 > 10.67.29.202.47278: Flags [P.], seq 33139:40746, ack 647, win 85, options [nop,nop,TS val 3147456830 ecr 1844699164], length 7607

Let's now have a look at what's happening in the case of a "large" event log file.

Is it possible that we missed a networkpolicy from the list?
Maybe, instead of transferring the HDFS file from a single an-worker datanode, it is trying to get it in a distributed way from several hosts, but failing to connect to some/one of them?

If you run your tcpdump again and measure traffic to port 50010 to any host, what does this look like?

I'm now looking at what a tcpdump looks like for a large file:

10:38:58.274479 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [.], ack 10, win 85, options [nop,nop,TS val 901748873 ecr 116319192], length 0
10:38:58.274733 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 1:151, ack 10, win 85, options [nop,nop,TS val 901748873 ecr 116319192], length 150
10:38:58.275997 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 151:317, ack 407, win 85, options [nop,nop,TS val 901748874 ecr 116319193], length 166
10:38:58.277593 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 317:331, ack 649, win 85, options [nop,nop,TS val 901748876 ecr 116319195], length 14
10:38:58.277806 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 331:7571, ack 649, win 85, options [nop,nop,TS val 901748876 ecr 116319195], length 7240
10:38:58.277807 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 7571:8523, ack 649, win 85, options [nop,nop,TS val 901748876 ecr 116319195], length 952
10:38:58.277838 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 8523:12867, ack 649, win 85, options [nop,nop,TS val 901748876 ecr 116319195], length 4344
10:38:58.278002 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 12867:24451, ack 649, win 85, options [nop,nop,TS val 901748876 ecr 116319196], length 11584
10:38:58.278027 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 24451:33139, ack 649, win 85, options [nop,nop,TS val 901748876 ecr 116319196], length 8688
10:38:58.278067 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 33139:41827, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319196], length 8688
10:38:58.278117 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 41827:53411, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319196], length 11584
10:38:58.278179 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 53411:64995, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319196], length 11584
10:38:58.278182 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 64995:73683, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319196], length 8688
10:38:58.278184 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 73683:82371, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319196], length 8688
10:38:58.278188 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 82371:93955, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319196], length 11584
10:38:58.278190 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 93955:99747, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319196], length 5792
10:38:58.278222 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 99747:114227, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319196], length 14480
10:38:58.278225 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 114227:122915, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319196], length 8688
10:38:58.278368 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 122915:163459, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319196], length 40544
10:38:58.278370 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 163459:172147, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319196], length 8688
10:38:58.278437 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [.], seq 172147:219931, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319196], length 47784
10:38:58.278440 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 219931:229707, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319196], length 9776
10:38:58.278564 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 229707:261563, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319196], length 31856
10:38:58.278834 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [.], seq 261563:309347, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319196], length 47784
10:38:58.278840 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 309347:326723, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319196], length 17376
10:38:58.278846 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 326723:344395, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319196], length 17672
10:38:58.278849 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 344395:352587, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319196], length 8192
10:38:58.278852 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 352587:360779, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319196], length 8192
10:38:58.278896 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 360779:368971, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319196], length 8192
10:38:58.279036 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 368971:409931, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319197], length 40960
10:38:58.279039 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 409931:418123, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319197], length 8192
10:38:58.279098 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 418123:434507, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319197], length 16384
10:38:58.279100 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 434507:442699, ack 649, win 85, options [nop,nop,TS val 901748877 ecr 116319197], length 8192
...
10:38:58.280563 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 1032523:1040715, ack 649, win 85, options [nop,nop,TS val 901748879 ecr 116319198], length 8192
10:38:58.280598 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 1040715:1048907, ack 649, win 85, options [nop,nop,TS val 901748879 ecr 116319198], length 8192
10:38:58.280601 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 1048907:1057099, ack 649, win 85, options [nop,nop,TS val 901748879 ecr 116319198], length 8192
10:38:58.280682 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 1057099:1065291, ack 649, win 85, options [nop,nop,TS val 901748879 ecr 116319198], length 8192
10:38:58.280685 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 1065291:1073483, ack 649, win 85, options [nop,nop,TS val 901748879 ecr 116319198], length 8192
10:38:58.280853 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [.], seq 1073483:1121267, ack 649, win 85, options [nop,nop,TS val 901748879 ecr 116319198], length 47784
10:38:58.280857 IP an-worker1140.eqiad.wmnet.50010 > 10.67.29.202.54266: Flags [P.], seq 1121267:1134299, ack 649, win 85, options [nop,nop,TS val 901748879 ecr 116319198], length 13032

We see that the last packet is a PUSH/ACK and then nothing else, like something is hanging. I now need to look at the datanode logs, to see if the transfer failed or something.

brouberol@an-master1001:~$ sudo kerberos-run-command hdfs hdfs fsck /var/log/spark/application_1695896957545_627204.lz4 -files -blocks -locations
Connecting to namenode via https://an-master1001.eqiad.wmnet:50470/fsck?ugi=hdfs&files=1&blocks=1&locations=1&path=%2Fvar%2Flog%2Fspark%2Fapplication_1695896957545_627204.lz4
FSCK started by hdfs (auth:KERBEROS_SSL) from /10.64.5.26 for path /var/log/spark/application_1695896957545_627204.lz4 at Fri Jan 05 10:40:45 UTC 2024
/var/log/spark/application_1695896957545_627204.lz4 6271896 bytes, 1 block(s):  OK
0. BP-1552854784-10.64.21.110-1405114489661:blk_2418035632_1344382728 len=6271896 Live_repl=3 [DatanodeInfoWithStorage[10.64.36.108:50010,DS-44d55bb6-2ca5-4643-9ce7-560826a88f84,DISK], DatanodeInfoWithStorage[10.64.5.9:50010,DS-d133bd5b-cc20-492f-8a3d-dfd1e7cc9eb1,DISK], DatanodeInfoWithStorage[10.64.5.15:50010,DS-2be3a394-4702-4f35-8b5b-40dd5cd287c8,DISK]]

Status: HEALTHY
 Total size:	6271896 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	1 (avg. block size 6271896 B)
 Minimally replicated blocks:	1 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	3
 Average block replication:	3.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		86
 Number of racks:		23
FSCK ended at Fri Jan 05 10:40:45 UTC 2024 in 1 milliseconds


The filesystem under path '/var/log/spark/application_1695896957545_627204.lz4' is HEALTHY

The file has a blkid of 2418035632_1344382728.

brouberol@an-worker1140:~$ grep 2418035632_1344382728 /var/log/hadoop-hdfs/*.{out,log,audit}
/var/log/hadoop-hdfs/hadoop-hdfs-datanode-an-worker1140.log:2024-01-05 10:37:21,654 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1552854784-10.64.21.110-1405114489661:blk_2418035632_1344382728 src: /10.64.53.30:45178 dest: /10.64.5.15:50010

/var/log/hadoop-hdfs/hadoop-hdfs-datanode-an-worker1140.log:2024-01-05 10:38:48,595 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.64.53.30:45178, dest: /10.64.5.15:50010, bytes: 6271896, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-820340752_28, offset: 0, srvID: 708d9cfd-63fc-44ec-ba52-86dc88f38fd6, blockid: BP-1552854784-10.64.21.110-1405114489661:blk_2418035632_1344382728, duration(ns): 86791139377

/var/log/hadoop-hdfs/hadoop-hdfs-datanode-an-worker1140.log:2024-01-05 10:38:48,595 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1552854784-10.64.21.110-1405114489661:blk_2418035632_1344382728, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=2:[10.64.5.9:50010, 10.64.36.108:50010] terminating

The timing does not match though, as 10:38:48 lines up with

24/01/05 10:37:48 INFO FsHistoryProvider: Finished parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_627204.lz4.inprogress

which itself lines up with the following audit log

hdfs-audit.log:2024-01-05 10:38:48,603 INFO FSNamesystem.audit: allowed=true	ugi=brouberol@WIKIMEDIA (auth:TOKEN)	ip=/10.64.53.30	cmd=rename	src=/var/log/spark/application_1695896957545_627204.lz4.inprogress	dst=/var/log/spark/application_1695896957545_627204.lz4	perm=brouberol:analytics-privatedata-users:rw-rw----	proto=rpc
hdfs-audit.log:2024-01-05 10:38:48,607 INFO FSNamesystem.audit: allowed=true	ugi=brouberol@WIKIMEDIA (auth:TOKEN)	ip=/10.64.53.30	cmd=setTimes	src=/var/log/spark/application_1695896957545_627204.lz4	dst=null	perm=brouberol:analytics-privatedata-users:rw-rw----	proto=rpc

So, I _think_ what happened was:

  • the spark job started
  • the spark history server started to parse inprogress event log files
  • while the inprogress file was parsed, the spark job completed, causing the streaming of the inprogress file to simply stop
  • we see no trace in the HDFS log of a retrieval of the same (now renamed) file 10s later.

In other words, this is a race condition.

What we could try is:

  • setting spark.history.fs.inProgressOptimization.enabled to false
  • and possibly, increasing the spark.history.fs.update.interval from 10s to, say, 5min.

I'm not sure the 2nd is necessary, but the inProgressOptimization strategy coupled with a fast polling of HDFS seems to lead to race conditions, AFAICS.

Actually, I seem to be wrong in my recommendations. I restarted the application, causing it to re-parse all event log files from HDFS, and I see

24/01/05 11:04:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and started at http://spark-history-analytics-hadoop-57774468c-gw2t5:18080
24/01/05 11:04:11 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_627204.lz4 for listing data...
24/01/05 11:04:11 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_627243.lz4 for listing data...
24/01/05 11:04:12 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_627122.lz4 for listing data...
24/01/05 11:04:12 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_627155.lz4 for listing data...
24/01/05 11:04:12 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_627079.lz4 for listing data...
24/01/05 11:04:12 INFO FsHistoryProvider: Finished parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_627122.lz4
24/01/05 11:04:12 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_627058.lz4 for listing data...
24/01/05 11:04:12 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_627055.lz4 for listing data...
24/01/05 11:04:12 INFO FsHistoryProvider: Finished parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_627055.lz4
24/01/05 11:04:12 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_627045.lz4 for listing data...
24/01/05 11:04:12 INFO FsHistoryProvider: Finished parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_627079.lz4
24/01/05 11:04:12 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_627042.lz4 for listing data...
24/01/05 11:04:12 INFO FsHistoryProvider: Finished parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_627045.lz4
24/01/05 11:04:12 INFO FsHistoryProvider: Finished parsing hdfs://analytics-hadoop/var/log/spark/application_1695896957545_627042.lz4

Each small file has a corresponding Finished parsing log, and each "large" one does not.

I have taken an strace and a tcpdump while the SHS was running, and focused on the following log:

24/01/10 11:23:54 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1704805203730_5166.lz4 for listing data...

The tcpdump showed a Reset ([R.]) flag from the SHS IP to the an-worker host IP, ending the data stream the worker was sending the SHS:

image.png (1×1 px, 810 KB)
Screenshot from 2024-01-10 14-17-42.png (2×1 px, 626 KB)

When focusing on the strace output, we target the PID that wrote the 24/01/10 11:18:54 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1704805203730_5166.lz4.inprogress for listing data... log line and see the following (truncated to 300 characters width for readability)

3247648 11:18:54 write(2, "24/01/10 11:18:54 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1704805203730_5166.lz4.inprogress for listing data...\n", 154) = 154
3247648 11:18:54 socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 312
3247648 11:18:54 setsockopt(312, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0
3247648 11:18:54 setsockopt(312, SOL_TCP, TCP_NODELAY, [1], 4) = 0
3247648 11:18:54 setsockopt(312, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
3247648 11:18:54 fcntl(312, F_GETFL)    = 0x2 (flags O_RDWR)
3247648 11:18:54 fcntl(312, F_SETFL, O_RDWR|O_NONBLOCK) = 0
3247648 11:18:54 connect(312, {sa_family=AF_INET6, sin6_port=htons(50020), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::ffff:10.64.142.3", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
3247648 11:18:54 epoll_ctl(313, EPOLL_CTL_ADD, 312, {EPOLLOUT, {u32=312, u64=30064771384}}) = 0
3247648 11:18:54 epoll_wait(313, [{EPOLLOUT, {u32=312, u64=30064771384}}], 1024, 20000) = 1
3247648 11:18:54 epoll_ctl(313, EPOLL_CTL_DEL, 312, 0x7f70aa1b6cd4) = 0
3247648 11:18:54 epoll_wait(313, [], 1024, 0) = 0
3247648 11:18:54 poll([{fd=312, events=POLLOUT}], 1, 0) = 1 ([{fd=312, revents=POLLOUT}])
3247648 11:18:54 getsockopt(312, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
3247648 11:18:54 getsockname(312, {sa_family=AF_INET6, sin6_port=htons(54346), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::ffff:10.67.29.205", &sin6_addr), sin6_scope_id=0}, [28]) = 0
3247648 11:18:54 getsockname(312, {sa_family=AF_INET6, sin6_port=htons(54346), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::ffff:10.67.29.205", &sin6_addr), sin6_scope_id=0}, [28]) = 0
3247648 11:18:54 fcntl(312, F_GETFL)    = 0x802 (flags O_RDWR|O_NONBLOCK)
3247648 11:18:54 fcntl(312, F_SETFL, O_RDWR) = 0
3247648 11:18:54 fcntl(312, F_GETFL)    = 0x2 (flags O_RDWR)
3247648 11:18:54 fcntl(312, F_SETFL, O_RDWR|O_NONBLOCK) = 0
3247648 11:18:54 write(312, "hrpc\t\0\337\0\0\0\16\n\10\2\20\0\30A\"\0(\1\2\20\1", 25 <unfinished ...>
3247648 11:18:54 <... write resumed>)   = 25
3247648 11:18:54 read(312,  <unfinished ...>
3247648 11:18:54 <... read resumed>0x7f706801e120, 8192) = -1 EAGAIN (Resource temporarily unavailable)
3247648 11:18:54 epoll_ctl(313, EPOLL_CTL_ADD, 312, {EPOLLIN, {u32=312, u64=30064771384}}) = 0
3247648 11:18:54 epoll_wait(313,  <unfinished ...>
3247648 11:18:54 <... epoll_wait resumed>[{EPOLLIN, {u32=312, u64=30064771384}}], 1024, 60000) = 1
3247648 11:18:54 epoll_ctl(313, EPOLL_CTL_DEL, 312, 0x7f70aa1b51e4) = 0
3247648 11:18:54 epoll_wait(313, [], 1024, 0) = 0
3247648 11:18:54 read(312, "\0\0\1\1\16\10\337\377\377\377\17\20\0\30\t:\0@\1\360\1\20\1\"\266\1\n\5TOKEN\22\nDIGEST-MD5\32\0\"\7default*\225\1realm=\"default\",nonce=\"weB1DnOCOqZTkotDzS9nIuKDuBKXCwckPs5zk6y9\",qop=\"auth-conf\",charset=utf-8,cipher=\"3des,rc4,des,rc4-56,rc4-40\",algorithm=md5-sess
3247648 11:18:54 write(312, "\0\0\1\233\n\10\2\20\0\30A\"\0(\1\216\3\20\2\32\351\2charset=utf-8,username=\"igGM9T4PqIwqHfnQAAAABXNwYXJrAAAAKEJQLTE1NTI4NTQ3ODQtMTAuNjQuMjEuMTEwLTE0MDUxMTQ0ODk2NjGMkDu8OwEEUkVBRA==\",realm=\"default\",nonce=\"weB1DnOCOqZTkotDzS9nIuKDuBKXCwckPs5zk6y9\",nc=00000001,cnonc
3247648 11:18:54 read(312, 0x7f706801e120, 8192) = -1 EAGAIN (Resource temporarily unavailable)
3247648 11:18:54 epoll_ctl(313, EPOLL_CTL_ADD, 312, {EPOLLIN, {u32=312, u64=30064771384}}) = 0
3247648 11:18:54 epoll_wait(313, [{EPOLLIN, {u32=312, u64=30064771384}}], 1024, 60000) = 1
3247648 11:18:54 epoll_ctl(313, EPOLL_CTL_DEL, 312, 0x7f70aa1b51e4) = 0
3247648 11:18:54 epoll_wait(313, [], 1024, 0) = 0
3247648 11:18:54 read(312, "\0\0\0<\16\10\337\377\377\377\17\20\0\30\t:\0@\1,\20\0\32(rspauth=75ba505dd18e60e87776debf14e61697", 8192) = 64
3247648 11:18:54 clone(child_stack=0x7f7109dfcfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[375], tls=0x7f7109dfd700, child_tidptr=0x7f7109dfd9d0) = 375
3395559 11:18:54 read(312, "\0\0\0L\10\10\337\377\377\377\17\20\0B\20\5\32>d\4\320j\31\35\331\342\324\257\270\324\302\231\372e\221U\241\372 \257\f\16\356%\20\3648\320`t\344u\212\177\336\23\327_%\226\31\260\366}b\337\33s\372@\20Q\326\311\0\1\0\0\0\0", 8192) = 80
3247648 11:18:54 socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 312
3247648 11:18:54 setsockopt(312, SOL_IPV6, IPV6_V6ONLY, [0], 4 <unfinished ...>
3247648 11:18:54 <... setsockopt resumed>) = 0
3247648 11:18:54 fcntl(312, F_GETFL <unfinished ...>
3247648 11:18:54 <... fcntl resumed>)   = 0x2 (flags O_RDWR)
3247648 11:18:54 fcntl(312, F_SETFL, O_RDWR|O_NONBLOCK) = 0
3247648 11:18:54 connect(312, {sa_family=AF_INET6, sin6_port=htons(50010), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::ffff:10.64.142.3", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
3247648 11:18:54 epoll_ctl(313, EPOLL_CTL_ADD, 312, {EPOLLOUT, {u32=312, u64=30064771384}} <unfinished ...>
3247648 11:18:54 <... epoll_ctl resumed>) = 0
3247648 11:18:54 epoll_wait(313,  <unfinished ...>
3247648 11:18:54 <... epoll_wait resumed>[{EPOLLOUT, {u32=312, u64=30064771384}}], 1024, 60000) = 1
3247648 11:18:54 epoll_ctl(313, EPOLL_CTL_DEL, 312, 0x7f70aa1b6ac4) = 0
3247648 11:18:54 epoll_wait(313, [], 1024, 0) = 0
3247648 11:18:54 poll([{fd=312, events=POLLOUT}], 1, 0 <unfinished ...>
3247648 11:18:54 <... poll resumed>)    = 1 ([{fd=312, revents=POLLOUT}])
3247648 11:18:54 getsockopt(312, SOL_SOCKET, SO_ERROR,  <unfinished ...>
3247648 11:18:54 <... getsockopt resumed>[0], [4]) = 0
3247648 11:18:54 getsockname(312,  <unfinished ...>
3247648 11:18:54 <... getsockname resumed>{sa_family=AF_INET6, sin6_port=htons(49208), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::ffff:10.67.29.205", &sin6_addr), sin6_scope_id=0}, [28]) = 0
3247648 11:18:54 getsockname(312,  <unfinished ...>
3247648 11:18:54 <... getsockname resumed>{sa_family=AF_INET6, sin6_port=htons(49208), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::ffff:10.67.29.205", &sin6_addr), sin6_scope_id=0}, [28]) = 0
3247648 11:18:54 fcntl(312, F_GETFL <unfinished ...>
3247648 11:18:54 <... fcntl resumed>)   = 0x802 (flags O_RDWR|O_NONBLOCK)
3247648 11:18:54 fcntl(312, F_SETFL, O_RDWR <unfinished ...>
3247648 11:18:54 <... fcntl resumed>)   = 0
3247648 11:18:54 setsockopt(312, SOL_TCP, TCP_NODELAY, [1], 4 <unfinished ...>
3247648 11:18:54 <... setsockopt resumed>) = 0
3247648 11:18:54 fcntl(312, F_GETFL <unfinished ...>
3247648 11:18:54 <... fcntl resumed>)   = 0x2 (flags O_RDWR)
3247648 11:18:54 fcntl(312, F_SETFL, O_RDWR|O_NONBLOCK) = 0
3247648 11:18:54 write(312, "\336", 1 <unfinished ...>
3247648 11:18:54 <... write resumed>)   = 1
3247648 11:18:54 write(312, "\255", 1)  = 1
3247648 11:18:54 write(312, "\276", 1 <unfinished ...>
3247648 11:18:54 <... write resumed>)   = 1
3247648 11:18:54 write(312, "\357", 1 <unfinished ...>
3247648 11:18:54 <... write resumed>)   = 1
3247648 11:18:54 write(312, "\4\10\0\22\0", 5) = 5
3247648 11:18:54 read(312, 0x7f706801e120, 1) = -1 EAGAIN (Resource temporarily unavailable)
3247648 11:18:54 epoll_ctl(313, EPOLL_CTL_ADD, 312, {EPOLLIN, {u32=312, u64=30064771384}}) = 0
3247648 11:18:54 epoll_wait(313,  <unfinished ...>
3247648 11:18:54 <... epoll_wait resumed>[{EPOLLIN, {u32=312, u64=30064771384}}], 1024, 60000) = 1
3247648 11:18:54 epoll_ctl(313, EPOLL_CTL_DEL, 312, 0x7f70aa1b6484) = 0
3247648 11:18:54 epoll_wait(313,  <unfinished ...>
3247648 11:18:54 <... epoll_wait resumed>[], 1024, 0) = 0
3247648 11:18:54 read(312, "\224", 1)   = 1
3247648 11:18:54 read(312,  <unfinished ...>
3247648 11:18:54 <... read resumed>"\1", 1) = 1
3247648 11:18:54 read(312, "\10\0\22\217\1realm=\"0\",nonce=\"GnNhXfiZ1SaohZmKo0rDBKIshaxo5och/+Rbze4e\",qop=\"auth-conf\",charset=utf-8,cipher=\"3des,rc4,des,rc4-56,rc4-40\",algorithm=md5-sess", 148) = 148
3247648 11:18:54 write(312, "\213\3\10\0\22\202\3charset=utf-8,username=\"706607567 BP-1552854784-10.64.21.110-1405114489661 GIdSpBBaxQk=\",realm=\"0\",nonce=\"GnNhXfiZ1SaohZmKo0rDBKIshaxo5och/+Rbze4e\",nc=00000001,cnonce=\"iy06yydy6NrfcZj4/jmtea6A0Ol6KupA/DProF/p\",digest-uri=\"hdfs/0\",maxbuf=6553
3247648 11:18:54 read(312, 0x7f706801e120, 1) = -1 EAGAIN (Resource temporarily unavailable)
3247648 11:18:54 epoll_ctl(313, EPOLL_CTL_ADD, 312, {EPOLLIN, {u32=312, u64=140119013065016}}) = 0
3247648 11:18:54 epoll_wait(313, [{EPOLLIN, {u32=312, u64=140119013065016}}], 1024, 60000) = 1
3247648 11:18:54 epoll_ctl(313, EPOLL_CTL_DEL, 312, 0x7f70aa1b6464) = 0
3247648 11:18:54 epoll_wait(313, [], 1024, 0) = 0
3247648 11:18:54 read(312, "\244", 1)   = 1
3247648 11:18:54 read(312, "\1", 1)     = 1
3247648 11:18:54 read(312, "\10\0\22(rspauth=6c85e51905d890ad108e6521668fdc14\"v\10\2\22&^\3746\n\f5\5S\354G\215\21\21s\355\302{\33\251\315-\26\3020\255\276\235\233DqYR\0\1\0\0\0\0\32\20\306W\21\270\2649Cf\371\214\367\26\224\267\330|\"&\332Nk\265j!\330\371d\333\24\2001\274\222~\212\361V%M\236\22\251
3247648 11:18:54 socket(AF_INET, SOCK_DGRAM, IPPROTO_IP <unfinished ...>
3247648 11:18:54 <... socket resumed>)  = 316
3247648 11:18:54 ioctl(316, SIOCGIFCONF, {ifc_len= <unfinished ...>
3247648 11:18:54 <... ioctl resumed>2 * sizeof(struct ifreq), ifc_buf=NULL}) = 0
3247648 11:18:54 ioctl(316, SIOCGIFCONF, {ifc_len=2 * sizeof(struct ifreq), ifc_buf=[{ifr_name="lo", ifr_addr={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}}, {ifr_name="eth0", ifr_addr={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.67.29.205")}}]}) = 0
3247648 11:18:54 ioctl(316, SIOCGIFFLAGS, {ifr_name="lo", ifr_flags=IFF_UP|IFF_LOOPBACK|IFF_RUNNING}) = 0
3247648 11:18:54 ioctl(316, SIOCGIFNETMASK, {ifr_name="lo", ifr_netmask={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("255.0.0.0")}}) = 0
3247648 11:18:54 ioctl(316, SIOCGIFINDEX, {ifr_name="lo", }) = 0
3247648 11:18:54 ioctl(316, SIOCGIFFLAGS, {ifr_name="eth0", ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST}) = 0
3247648 11:18:54 ioctl(316, SIOCGIFBRDADDR, {ifr_name="eth0", ifr_broadaddr={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}}) = 0
3247648 11:18:54 ioctl(316, SIOCGIFNETMASK, {ifr_name="eth0", ifr_netmask={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("255.255.255.255")}}) = 0
3247648 11:18:54 ioctl(316, SIOCGIFINDEX, {ifr_name="eth0", }) = 0
3247648 11:18:54 close(316)             = 0
3247648 11:18:54 socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP <unfinished ...>
3247648 11:18:54 <... socket resumed>)  = 316
3247648 11:18:54 openat(AT_FDCWD, "/proc/net/if_inet6", O_RDONLY <unfinished ...>
3247648 11:18:54 <... openat resumed>)  = 321
3247648 11:18:54 fstat(321,  <unfinished ...>
3247648 11:18:54 <... fstat resumed>{st_mode=S_IFREG|0444, st_size=0, ...}) = 0
3247648 11:18:54 read(321,  <unfinished ...>
3247648 11:18:54 <... read resumed>"262000000861030222de3a32fa8abcec 03 80 00 80     eth0\n00000000000000000000000000000001 01 80 10 80       lo\nfe80000000000000a07cebfffe6df677 03 40 20 80     eth0\n", 1024) = 162
3247648 11:18:54 read(321,  <unfinished ...>
3247648 11:18:54 <... read resumed>"", 1024) = 0
3247648 11:18:54 close(321 <unfinished ...>
3247648 11:18:54 <... close resumed>)   = 0
3247648 11:18:54 close(316 <unfinished ...>
3247648 11:18:54 <... close resumed>)   = 0
3247648 11:18:54 write(312, "\33\0214\226\357p\317f\334R~\356\2078h\343\245\16\33:M\1M=\333t\254\22I\34\265U\220m&\233\346\351\233\210\r0Z1\22?:\314\355\34\244\34\36\6\20r\352\310y[eH\252[\\\246\343\37\271\330\304\245\263rA\316r(qm\3100\227\2667\31\266Q*\0-\236\0250B=Jj\254o\23DW3s\250U\263B\272\211
3247648 11:18:54 read(312, 0x7f7068069680, 8192) = -1 EAGAIN (Resource temporarily unavailable)
3247648 11:18:54 epoll_ctl(313, EPOLL_CTL_ADD, 312, {EPOLLIN, {u32=312, u64=140119013065016}}) = 0
3247648 11:18:54 epoll_wait(313, [{EPOLLIN, {u32=312, u64=140119013065016}}], 1024, 60000) = 1
3247648 11:18:54 epoll_ctl(313, EPOLL_CTL_DEL, 312, 0x7f70aa1b6944) = 0
3247648 11:18:54 epoll_wait(313, [], 1024, 0) = 0
3247648 11:18:54 read(312, "\230\3\2\20\214eD\236\260\ra\266\254 >\312~\377r5\205\271\332\364:\251\301e\213\255\260f\5\200d\211t\4\325\31\345\366\201\367\325\314'\311#\273s\351 e\200\334\226\34]\2%\32)5\33\245[\21\236=\\\322u{;\232\24\217\23IY\262\243S\303j\2173\340\335@\0363&?\16\2260\270\234\353\3
3247648 11:18:54 mprotect(0x7f706806e000, 131072, PROT_READ|PROT_WRITE <unfinished ...>
3247648 11:18:54 <... mprotect resumed>) = 0
3247648 11:18:54 read(312,  <unfinished ...>
3247648 11:18:54 <... read resumed>"6q\353L-\344'Bq\313\347$\246Vg\327\\\336*\231\36\233\220R\vf\223\250\372P\364w\241$\306\202\301\2757\242k\300\201\360\6o\376\350\216_3lX{\336I\333\265N\233K\221\"\3108\240\267{\302Jb\250k\211&\372\27594P\3\364\224\245'|)\325\331\255\213\228\330\2174\204\242\354\24
3247648 11:18:54 read(312, "\307\v\366\332=\6l\36\235G\276\230j\277O\260:h\360#/\261D]=\355\34\246\2\r\252\2309\333\245y\20\315\246\36\226\rUv\200\207\256\3252\26\202t\266os\371+:\350l\2\2157K\300\316;\345\264\241\226\25\2025m\24H\221L#4\202\214\207\331(]\213\303\26FI\243\241\324\261\250K\0ga\230\34
3247648 11:18:54 read(312,  <unfinished ...>
3247648 11:18:54 <... read resumed>"\1\251a\373s\247+\336e\336\25\234;f\204\21\252\341\317\330\10\341('\t\302\322\371\340/\213\273\334\306\5a\360\n\200\16(\277k\230\325\\\347\35f\360\352_\231u\304\361\266#q\360)zyp\320\346\374*K\17\333\305\273\275ip\235\37\335N\253\316\2542\221\202+\231,\337$R\227\3
3247648 11:18:54 close(312 <unfinished ...>
3247648 11:18:54 <... close resumed>)   = 0

See full log

We see that the fd 312 maps to the socket opened to 10.64.142.3:50010 and 10.64.142.3:50020 (respectively Datanode metadata and Datanode data transfer ports, according to https://ambari.apache.org/1.2.3/installing-hadoop-using-ambari/content/reference_chap2_1.html).

brouberol@dse-k8s-worker1002:~$ sudo ls -alh /proc/3247327/fd/312
lrwx------ 1 analytics-privatedata 900 64 Jan 10 09:45 /proc/3247327/fd/312 -> 'socket:[645069240]'

We filter the strace on that file descriptor for operations connect, read and close and see the following

3247648 11:18:54 connect(312, {sa_family=AF_INET6, sin6_port=htons(50020), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::ffff:10.64.142.3", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
3247648 11:18:54 read(312,  <unfinished ...>
3247648 11:18:54 read(312, "\0\0\1\1\16\10\337\377\377\377\17\20\0\30\t:\0@\1\360\1\20\1\"\266\1\n\5TOKEN\22\nDIGEST-MD5\32\0\"\7default*\225\1realm=\"default\",nonce=\"weB1DnOCOqZTkotDzS9nIuKDuBKXCwckPs5zk6y9\",qop=\"auth-conf\",charset=utf-8,cipher=\"3des,rc4,des,rc4-56,rc4-40\",algorithm=md5-sess
3247648 11:18:54 read(312, 0x7f706801e120, 8192) = -1 EAGAIN (Resource temporarily unavailable)
3247648 11:18:54 read(312, "\0\0\0<\16\10\337\377\377\377\17\20\0\30\t:\0@\1,\20\0\32(rspauth=75ba505dd18e60e87776debf14e61697", 8192) = 64
3395559 11:18:54 read(312, "\0\0\0L\10\10\337\377\377\377\17\20\0B\20\5\32>d\4\320j\31\35\331\342\324\257\270\324\302\231\372e\221U\241\372 \257\f\16\356%\20\3648\320`t\344u\212\177\336\23\327_%\226\31\260\366}b\337\33s\372@\20Q\326\311\0\1\0\0\0\0", 8192) = 80
3247648 11:18:54 connect(312, {sa_family=AF_INET6, sin6_port=htons(50010), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::ffff:10.64.142.3", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
3247648 11:18:54 read(312, 0x7f706801e120, 1) = -1 EAGAIN (Resource temporarily unavailable)
3247648 11:18:54 read(312, "\224", 1)   = 1
3247648 11:18:54 read(312,  <unfinished ...>
3247648 11:18:54 read(312, "\10\0\22\217\1realm=\"0\",nonce=\"GnNhXfiZ1SaohZmKo0rDBKIshaxo5och/+Rbze4e\",qop=\"auth-conf\",charset=utf-8,cipher=\"3des,rc4,des,rc4-56,rc4-40\",algorithm=md5-sess", 148) = 148
3247648 11:18:54 read(312, 0x7f706801e120, 1) = -1 EAGAIN (Resource temporarily unavailable)
3247648 11:18:54 read(312, "\244", 1)   = 1
3247648 11:18:54 read(312, "\1", 1)     = 1
3247648 11:18:54 read(312, "\10\0\22(rspauth=6c85e51905d890ad108e6521668fdc14\"v\10\2\22&^\3746\n\f5\5S\354G\215\21\21s\355\302{\33\251\315-\26\3020\255\276\235\233DqYR\0\1\0\0\0\0\32\20\306W\21\270\2649Cf\371\214\367\26\224\267\330|\"&\332Nk\265j!\330\371d\333\24\2001\274\222~\212\361V%M\236\22\251
3247648 11:18:54 read(312, 0x7f7068069680, 8192) = -1 EAGAIN (Resource temporarily unavailable)
3247648 11:18:54 read(312, "\230\3\2\20\214eD\236\260\ra\266\254 >\312~\377r5\205\271\332\364:\251\301e\213\255\260f\5\200d\211t\4\325\31\345\366\201\367\325\314'\311#\273s\351 e\200\334\226\34]\2%\32)5\33\245[\21\236=\\\322u{;\232\24\217\23IY\262\243S\303j\2173\340\335@\0363&?\16\2260\270\234\353\3
3247648 11:18:54 read(312,  <unfinished ...>
3247648 11:18:54 read(312, "\307\v\366\332=\6l\36\235G\276\230j\277O\260:h\360#/\261D]=\355\34\246\2\r\252\2309\333\245y\20\315\246\36\226\rUv\200\207\256\3252\26\202t\266os\371+:\350l\2\2157K\300\316;\345\264\241\226\25\2025m\24H\221L#4\202\214\207\331(]\213\303\26FI\243\241\324\261\250K\0ga\230\34
3247648 11:18:54 read(312,  <unfinished ...>
3247648 11:18:54 close(312 <unfinished ...>

What I'm not sure of at this point is whether the socket was closed before the read was finished and thus the TCP/IP stack sent the RST packet closing the transfer, or if the SHS closed the socket _because_ the RST packet was sent.

We're lacking microsecond precision here to be able to correlate events. I'll rerun the whole experiment with the -tt strace flag.

Rerunning the experiment with strace -tt on dse-k8s-worker1002, we see

3247648 13:48:54.609794 connect(312, {sa_family=AF_INET6, sin6_port=htons(50010), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::ffff:10.64.36.9", &sin6_addr), sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
...
3247648 13:48:54.619782 close(312)      = 0

and

13:48:54.619852 IP 10.67.29.205.33474 > an-worker1133.eqiad.wmnet.50010: Flags [R.], seq 649, ack 534755, win 762, options [nop,nop,TS val 159471777 ecr 2742734176], length 0

So we now know that the close syscall caused the RST packet to be sent to an-worker1133.eqiad.wmnet.50010.

At that point, our best bet would be to increase logging verbosity, to get better insight as to what's happening in the spark history process. Looking at the code, I'm not seeing any logDebug call that would indicate that we're missing log data, but at that point, I'm really grasping for straws.

One thing that seems clear though, is that this is not a network issue.

This is what I'm seeing with the root logger level set to DEBUG for the .inprogress log file:

24/01/10 14:29:08 DEBUG FsHistoryProvider: Scanning hdfs:///var/log/spark with lastScanTime==1704896648829
24/01/10 14:29:08 DEBUG SaslRpcClient: reading next wrapped RPC packet
24/01/10 14:29:08 DEBUG Client: IPC Client (57264571) connection to an-master1001.eqiad.wmnet/10.64.5.26:8020 from spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA sending #141 org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing
24/01/10 14:29:08 DEBUG SaslRpcClient: wrapping token of length:116
24/01/10 14:29:08 DEBUG SaslRpcClient: Sending sasl message state: WRAP
token: "\005\004\002\377\000\000\000\000\000\000\000\000\001\001b\244\276\343\216\213\353\302\225F`\203\270\303\327\255f\037<6\360\331/\344H\016\203\026\337\277\243+\r|\355\311\"=F}x\'x\203\253ts\374\365\217\022\221\203\360\220\220\037\350C?\227\244?)\202\243M\212\254\311F\346\274b\305\331\000\032\356\242\340\304\247$\224\267\233\366F\207\321\267\005&m\231\265\273\214L\256\316B\205$\326X]\033\030\241\r\024\262\n6\022\364\324\v:\337\351\' -\266\036a\0025E\300r\335\"\b\271\206YB\001\271F\036\335\233i2pe\212]\030\345-\312\200\037a\036"

24/01/10 14:29:08 DEBUG SaslRpcClient: unwrapping token of length:1093
24/01/10 14:29:08 DEBUG Client: IPC Client (57264571) connection to an-master1001.eqiad.wmnet/10.64.5.26:8020 from spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA got value #141
24/01/10 14:29:08 DEBUG ProtobufRpcEngine: Call: getListing took 4ms
24/01/10 14:29:08 DEBUG FsHistoryProvider: New/updated attempts found: 1 ArrayBuffer(hdfs://analytics-hadoop/var/log/spark/application_1704805203730_5735.lz4.inprogress)
24/01/10 14:29:08 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1704805203730_5735.lz4.inprogress for listing data...
24/01/10 14:29:08 DEBUG SaslRpcClient: reading next wrapped RPC packet
24/01/10 14:29:08 DEBUG Client: IPC Client (57264571) connection to an-master1001.eqiad.wmnet/10.64.5.26:8020 from spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA sending #142 org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations
24/01/10 14:29:08 DEBUG SaslRpcClient: wrapping token of length:173
24/01/10 14:29:08 DEBUG SaslRpcClient: Sending sasl message state: WRAP
token: "\005\004\002\377\000\000\000\000\000\000\000\000\001\001b\245\220,)\271\376\033\356C6v\254\303\305L\314J\340\252t\020Q\271\a\375\351\017\257\374\3062\331L\261\033 \3500\346\267\204]\266\337c:\325T-=\344\330\001\360&\221bD\234\262\030?\333i\217Y\240\332\206=\004\257e\253\356\304\234\337\366\343=&g\267\274f\037\350\227\016s<\277\273\371N\n\304\306\267\305\343M\370\305\341=\304$\210\nBZr\253=$\255\353\373D?\242j4\"\262\030\264h\315B\a\004!\303\023m\270D\021\227@R\373\031\t\002\234\025\350Zn\243\267E\3576y\247/\211\317Ix~\006\"\271\203\023\3047\244=\023\211\200\024L\a\264\213\340\021}IH\242z\220{\022\277KqS9\022odo2\205\f!\365E-\2560\001\346\332r\'\377\032"

24/01/10 14:29:08 DEBUG SaslRpcClient: unwrapping token of length:1843
24/01/10 14:29:08 DEBUG Client: IPC Client (57264571) connection to an-master1001.eqiad.wmnet/10.64.5.26:8020 from spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA got value #142
24/01/10 14:29:08 DEBUG ProtobufRpcEngine: Call: getBlockLocations took 2ms
24/01/10 14:29:08 DEBUG DFSClient: newInfo = LocatedBlocks{
  fileLength=0
  underConstruction=true
  blocks=[LocatedBlock{BP-1552854784-10.64.21.110-1405114489661:blk_2419869437_1346216541; getBlockSize()=29962; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[10.64.53.7:50010,DS-91201f6b-aeb0-4089-993d-046feab1ee4e,DISK], DatanodeInfoWithStorage[10.64.53.9:50010,DS-8ca052a4-40a0-42a2-a21a-c3c42621d328,DISK], DatanodeInfoWithStorage[10.64.36.109:50010,DS-ad0ba592-b962-403b-b0dc-4157d5f3025b,DISK]]}]
  lastLocatedBlock=LocatedBlock{BP-1552854784-10.64.21.110-1405114489661:blk_2419869437_1346216541; getBlockSize()=29962; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[10.64.53.9:50010,DS-8ca052a4-40a0-42a2-a21a-c3c42621d328,DISK], DatanodeInfoWithStorage[10.64.36.109:50010,DS-ad0ba592-b962-403b-b0dc-4157d5f3025b,DISK], DatanodeInfoWithStorage[10.64.53.7:50010,DS-91201f6b-aeb0-4089-993d-046feab1ee4e,DISK]]}
  isLastBlockComplete=false}
24/01/10 14:29:08 DEBUG ClientDatanodeProtocolTranslatorPB: Connecting to datanode 10.64.53.9:50020 addr=/10.64.53.9:50020
24/01/10 14:29:08 DEBUG Client: getting client out of cache: org.apache.hadoop.ipc.Client@573284a5
24/01/10 14:29:08 DEBUG Client: The ping interval is 60000 ms.
24/01/10 14:29:08 DEBUG Client: Connecting to /10.64.53.9:50020
24/01/10 14:29:08 DEBUG Client: Setup connection to /10.64.53.9:50020
24/01/10 14:29:08 DEBUG UserGroupInformation: PrivilegedAction as:blk_2419869437_1346216541 (auth:SIMPLE) from:org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:828)
24/01/10 14:29:08 DEBUG SaslRpcClient: Sending sasl message state: NEGOTIATE

24/01/10 14:29:08 DEBUG SaslRpcClient: Get token info proto:interface org.apache.hadoop.hdfs.protocolPB.ClientDatanodeProtocolPB info:@org.apache.hadoop.security.token.TokenInfo(value=org.apache.hadoop.hdfs.security.token.block.BlockTokenSelector.class)
24/01/10 14:29:08 DEBUG SaslRpcClient: Creating SASL DIGEST-MD5(TOKEN)  client to authenticate to service at default
24/01/10 14:29:08 DEBUG SaslRpcClient: Use TOKEN authentication for protocol ClientDatanodeProtocolPB
24/01/10 14:29:08 DEBUG SaslRpcClient: SASL client callback: setting username: igGM9ew7a4wqHfnQAAAABXNwYXJrAAAAKEJQLTE1NTI4NTQ3ODQtMTAuNjQuMjEuMTEwLTE0MDUxMTQ0ODk2NjGMkDxG/QEEUkVBRA==
24/01/10 14:29:08 DEBUG SaslRpcClient: SASL client callback: setting userPassword
24/01/10 14:29:08 DEBUG SaslRpcClient: SASL client callback: setting realm: default
24/01/10 14:29:08 DEBUG SaslRpcClient: Sending sasl message state: INITIATE
token: "charset=utf-8,username=\"igGM9ew7a4wqHfnQAAAABXNwYXJrAAAAKEJQLTE1NTI4NTQ3ODQtMTAuNjQuMjEuMTEwLTE0MDUxMTQ0ODk2NjGMkDxG/QEEUkVBRA==\",realm=\"default\",nonce=\"P+owAuFKn58HDe9LnTWSvLnGuiC5V25zMcx9ARpS\",nc=00000001,cnonce=\"iy51DEQe1OnN0VNrNNZFEZ2V6uBEvJUlaTqt+uyV\",digest-uri=\"/default\",maxbuf=65536,response=951c8a49a134e8d369c4f1fcfba60d9c,qop=auth-conf,cipher=\"3des\""
auths {
  method: "TOKEN"
  mechanism: "DIGEST-MD5"
  protocol: ""
  serverId: "default"
}

24/01/10 14:29:08 DEBUG Client: Negotiated QOP is :auth-conf
24/01/10 14:29:08 DEBUG Client: IPC Client (57264571) connection to /10.64.53.9:50020 from blk_2419869437_1346216541: starting, having connections 2
24/01/10 14:29:08 DEBUG Client: IPC Client (57264571) connection to /10.64.53.9:50020 from blk_2419869437_1346216541 sending #143 org.apache.hadoop.hdfs.protocol.ClientDatanodeProtocol.getReplicaVisibleLength
24/01/10 14:29:08 DEBUG SaslRpcClient: reading next wrapped RPC packet
24/01/10 14:29:08 DEBUG SaslRpcClient: wrapping token of length:267
24/01/10 14:29:08 DEBUG SaslRpcClient: Sending sasl message state: WRAP
token: "L\025\022d\311\2149\273-\343\000\371\2416v}\004\311\356\005W{J\021\'b\225C\023W\341kq2Hz\322)\031\021fs\224\231\312e\202\200\233\245|\316@;j\214\n\034\240\265\324;\316\271j\371\271\236\234D\221\314\354\242vD \212t\235\340b\315d\326p\350\315\307\314\356\364\321{\313\247\342\310\2426/\320\200\r\300J\024\357\323\250\266I\222\025jNqq\336\344\254\314\242bpJ\205\225\201sp\365o\251\221\360q?D\225*\204T\205\256;r\025%\313sf7\a\362X\257\016\336\256\366K\301\354<.N,\016\304_\335\023?\240\2349\234\323E\3030\000$\'E(\341\373\370z\377S6\024kG\350-\301\245\363\350y%\006W\244\313\r\244|&j\237`z\220zX\330\224\235\"~\035\355\271\372\2641\345\353\323\363\376\346v\347&\317\3465\271]:\214B\363P\260\000/\'g\037\260\222 \277\371Y\322\310\215\336\366m\272\301\270\216J\307[g\201\225\273\000\001\000\000\000\000"

24/01/10 14:29:08 DEBUG SaslRpcClient: unwrapping token of length:62
24/01/10 14:29:08 DEBUG Client: IPC Client (57264571) connection to /10.64.53.9:50020 from blk_2419869437_1346216541 got value #143
24/01/10 14:29:08 DEBUG Client: IPC Client (57264571) connection to /10.64.53.9:50020 from blk_2419869437_1346216541: closed
24/01/10 14:29:08 DEBUG Client: IPC Client (57264571) connection to /10.64.53.9:50020 from blk_2419869437_1346216541: stopped, remaining connections 1
24/01/10 14:29:08 DEBUG ProtobufRpcEngine: Call: getReplicaVisibleLength took 26ms
24/01/10 14:29:08 DEBUG Client: stopping client from cache: org.apache.hadoop.ipc.Client@573284a5
24/01/10 14:29:08 DEBUG DFSClient: Connecting to datanode 10.64.53.9:50010
24/01/10 14:29:08 DEBUG SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
24/01/10 14:29:08 DEBUG SaslDataTransferClient: SASL client doing encrypted handshake for addr = /10.64.53.9, datanodeId = DatanodeInfoWithStorage[10.64.53.9:50010,DS-8ca052a4-40a0-42a2-a21a-c3c42621d328,DISK]
24/01/10 14:29:08 DEBUG SaslDataTransferClient: Client using encryption algorithm null
24/01/10 14:29:08 DEBUG Token: Cannot find class for token kind HDFS_BLOCK_TOKEN
24/01/10 14:29:08 DEBUG SaslDataTransferClient: Block token id is null, sending without handshake secret.
24/01/10 14:29:08 DEBUG DataTransferSaslUtil: Verifying QOP, requested QOP = [auth-conf], negotiated QOP = auth-conf
24/01/10 14:29:08 DEBUG SaslDataTransferClient: Client using cipher suite AES/CTR/NoPadding with server /10.64.53.9
24/01/10 14:29:08 DEBUG DataTransferSaslUtil: Creating IOStreamPair of CryptoInputStream and CryptoOutputStream.
24/01/10 14:29:08 DEBUG PerformanceAdvisory: Crypto codec org.apache.hadoop.crypto.OpensslAesCtrCryptoCodec is not available.
24/01/10 14:29:08 DEBUG PerformanceAdvisory: Using crypto codec org.apache.hadoop.crypto.JceAesCtrCryptoCodec.
24/01/10 14:29:08 DEBUG DFSClient: DeadNode detection is not enabled or given block LocatedBlocks{
  fileLength=0
  underConstruction=true
  blocks=[LocatedBlock{BP-1552854784-10.64.21.110-1405114489661:blk_2419869437_1346216541; getBlockSize()=29962; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[10.64.53.7:50010,DS-91201f6b-aeb0-4089-993d-046feab1ee4e,DISK], DatanodeInfoWithStorage[10.64.53.9:50010,DS-8ca052a4-40a0-42a2-a21a-c3c42621d328,DISK], DatanodeInfoWithStorage[10.64.36.109:50010,DS-ad0ba592-b962-403b-b0dc-4157d5f3025b,DISK]]}]
  lastLocatedBlock=LocatedBlock{BP-1552854784-10.64.21.110-1405114489661:blk_2419869437_1346216541; getBlockSize()=6119241; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[10.64.53.9:50010,DS-8ca052a4-40a0-42a2-a21a-c3c42621d328,DISK], DatanodeInfoWithStorage[10.64.36.109:50010,DS-ad0ba592-b962-403b-b0dc-4157d5f3025b,DISK], DatanodeInfoWithStorage[10.64.53.7:50010,DS-91201f6b-aeb0-4089-993d-046feab1ee4e,DISK]]}
  isLastBlockComplete=false} is null, skip to remove node.

And we see the following loglines for the parsing once the spark session was closed:

24/01/10 14:34:08 DEBUG FsHistoryProvider: Scanning hdfs:///var/log/spark with lastScanTime==1704896948835
24/01/10 14:34:08 DEBUG SaslRpcClient: reading next wrapped RPC packet
24/01/10 14:34:08 DEBUG Client: IPC Client (57264571) connection to an-master1001.eqiad.wmnet/10.64.5.26:8020 from spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA sending #184 org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing
24/01/10 14:34:08 DEBUG SaslRpcClient: wrapping token of length:116
24/01/10 14:34:08 DEBUG SaslRpcClient: Sending sasl message state: WRAP
token: "\005\004\002\377\000\000\000\000\000\000\000\000\f\211\206`\320\b\222dI\221\337\361\257\247c\224\246\035\370\031\224f\244\200\355\004\263\a\330\376\373\350D\344\310Qo\210s\371b\340@\345\344\251\335\332^VI\030p\341^\023\224\275?\371\306\233\246V\315\376\300C\034\215\321b\205z\213B\374\370\241\250oB\332 M\243\253U\223\345~\207\270\322\263\"zJ\3271:\322\262\033\257\231\253\311\253\207/\273\231\034\243!b\350\023H\253F\237\216\"\354\336\315\355\v\353\202\023B\306\234\367o\353*\367CY&\004\330\370B\367+<\247\257\325\360\016\273.\376\024\314\313\274\305"

24/01/10 14:34:08 DEBUG SaslRpcClient: unwrapping token of length:1082
24/01/10 14:34:08 DEBUG Client: IPC Client (57264571) connection to an-master1001.eqiad.wmnet/10.64.5.26:8020 from spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA got value #184
24/01/10 14:34:08 DEBUG ProtobufRpcEngine: Call: getListing took 3ms
24/01/10 14:34:08 DEBUG FsHistoryProvider: New/updated attempts found: 1 ArrayBuffer(hdfs://analytics-hadoop/var/log/spark/application_1704805203730_5735.lz4)
24/01/10 14:34:08 INFO FsHistoryProvider: Parsing hdfs://analytics-hadoop/var/log/spark/application_1704805203730_5735.lz4 for listing data...
24/01/10 14:34:08 DEBUG SaslRpcClient: reading next wrapped RPC packet
24/01/10 14:34:08 DEBUG Client: IPC Client (57264571) connection to an-master1001.eqiad.wmnet/10.64.5.26:8020 from spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA sending #185 org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations
24/01/10 14:34:08 DEBUG SaslRpcClient: wrapping token of length:162
24/01/10 14:34:08 DEBUG SaslRpcClient: Sending sasl message state: WRAP
token: "\005\004\002\377\000\000\000\000\000\000\000\000\f\211\206a\035\234\031\254\256\256Na>(&R\210\377\277X8\034N\244\302\321l\342A\260\340\264n\262h 8\252\307/\345-\250\034;[B\260Mk#\206\3307\336\304\252\331\b)\026W\234\347sef\021}\377\200=yR\037\376\021\373\355\315\241>\363\346j\312+\237\030\034\313.\004P\r\034\233*8\313XX\033\351l\231\322@\203g\275t~\366\225;\335\337\252q\004ge>\253\2359^\362\373a\202M\003=>\257\251\035\026\372\224\030;G\243\330\246s\b\214\210r0\245f\253\024~\344\241H\362\016d\016/\037m\234\016i5y\247>\323\220\2204\217\275K\235\320\252\352\202\017\243\032\370/\311\304v\034\"\002kro5\223D*\356\255"

24/01/10 14:34:08 DEBUG SaslRpcClient: unwrapping token of length:1846
24/01/10 14:34:08 DEBUG Client: IPC Client (57264571) connection to an-master1001.eqiad.wmnet/10.64.5.26:8020 from spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA got value #185
24/01/10 14:34:08 DEBUG ProtobufRpcEngine: Call: getBlockLocations took 2ms
24/01/10 14:34:08 DEBUG DFSClient: newInfo = LocatedBlocks{
  fileLength=6124447
  underConstruction=false
  blocks=[LocatedBlock{BP-1552854784-10.64.21.110-1405114489661:blk_2419869437_1346216541; getBlockSize()=6124447; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[10.64.36.109:50010,DS-ad0ba592-b962-403b-b0dc-4157d5f3025b,DISK], DatanodeInfoWithStorage[10.64.53.9:50010,DS-8ca052a4-40a0-42a2-a21a-c3c42621d328,DISK], DatanodeInfoWithStorage[10.64.53.7:50010,DS-91201f6b-aeb0-4089-993d-046feab1ee4e,DISK]]}]
  lastLocatedBlock=LocatedBlock{BP-1552854784-10.64.21.110-1405114489661:blk_2419869437_1346216541; getBlockSize()=6124447; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[10.64.53.9:50010,DS-8ca052a4-40a0-42a2-a21a-c3c42621d328,DISK], DatanodeInfoWithStorage[10.64.53.7:50010,DS-91201f6b-aeb0-4089-993d-046feab1ee4e,DISK], DatanodeInfoWithStorage[10.64.36.109:50010,DS-ad0ba592-b962-403b-b0dc-4157d5f3025b,DISK]]}
  isLastBlockComplete=true}
24/01/10 14:34:08 DEBUG DFSClient: Connecting to datanode 10.64.36.109:50010
24/01/10 14:34:08 DEBUG SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
24/01/10 14:34:08 DEBUG SaslDataTransferClient: SASL client doing encrypted handshake for addr = /10.64.36.109, datanodeId = DatanodeInfoWithStorage[10.64.36.109:50010,DS-ad0ba592-b962-403b-b0dc-4157d5f3025b,DISK]
24/01/10 14:34:08 DEBUG SaslDataTransferClient: Client using encryption algorithm null
24/01/10 14:34:08 DEBUG Token: Cannot find class for token kind HDFS_BLOCK_TOKEN
24/01/10 14:34:08 DEBUG SaslDataTransferClient: Block token id is null, sending without handshake secret.
24/01/10 14:34:08 DEBUG DataTransferSaslUtil: Verifying QOP, requested QOP = [auth-conf], negotiated QOP = auth-conf
24/01/10 14:34:08 DEBUG SaslDataTransferClient: Client using cipher suite AES/CTR/NoPadding with server /10.64.36.109
24/01/10 14:34:08 DEBUG DataTransferSaslUtil: Creating IOStreamPair of CryptoInputStream and CryptoOutputStream.
24/01/10 14:34:08 DEBUG PerformanceAdvisory: Crypto codec org.apache.hadoop.crypto.OpensslAesCtrCryptoCodec is not available.
24/01/10 14:34:08 DEBUG PerformanceAdvisory: Using crypto codec org.apache.hadoop.crypto.JceAesCtrCryptoCodec.
24/01/10 14:34:08 DEBUG DFSClient: DeadNode detection is not enabled or given block LocatedBlocks{
  fileLength=6124447
  underConstruction=false
  blocks=[LocatedBlock{BP-1552854784-10.64.21.110-1405114489661:blk_2419869437_1346216541; getBlockSize()=6124447; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[10.64.36.109:50010,DS-ad0ba592-b962-403b-b0dc-4157d5f3025b,DISK], DatanodeInfoWithStorage[10.64.53.9:50010,DS-8ca052a4-40a0-42a2-a21a-c3c42621d328,DISK], DatanodeInfoWithStorage[10.64.53.7:50010,DS-91201f6b-aeb0-4089-993d-046feab1ee4e,DISK]]}]
  lastLocatedBlock=LocatedBlock{BP-1552854784-10.64.21.110-1405114489661:blk_2419869437_1346216541; getBlockSize()=6124447; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[10.64.53.9:50010,DS-8ca052a4-40a0-42a2-a21a-c3c42621d328,DISK], DatanodeInfoWithStorage[10.64.53.7:50010,DS-91201f6b-aeb0-4089-993d-046feab1ee4e,DISK], DatanodeInfoWithStorage[10.64.36.109:50010,DS-ad0ba592-b962-403b-b0dc-4157d5f3025b,DISK]]}
  isLastBlockComplete=true} is null, skip to remove node.
24/01/10 14:34:15 DEBUG SecurityManager: user=null aclsEnabled=false viewAcls=runuser,spark viewAclsGroups=
24/01/10 14:34:15 DEBUG SaslRpcClient: reading next wrapped RPC packet
24/01/10 14:34:15 DEBUG Client: IPC Client (57264571) connection to an-master1001.eqiad.wmnet/10.64.5.26:8020 from spark/spark-history.svc.eqiad.wmnet@WIKIMEDIA sending #186 org.apache.hadoop.hdfs.protocol.ClientProtocol.setSafeMode
24/01/10 14:34:15 DEBUG SaslRpcClient: wrapping token of length:101
24/01/10 14:34:15 DEBUG SaslRpcClient: Sending sasl message state: WRAP

I'm not seeing any particular issue or traceback here. I looked for tracebacks in the whole logset, and found 3 types

24/01/10 14:14:07 DEBUG Shell: Failed to detect a valid hadoop home directory
java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
        at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:454)
        at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:425)
        at org.apache.hadoop.util.Shell.<clinit>(Shell.java:502)
        at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:78)
        at org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1564)
        at org.apache.hadoop.security.SecurityUtil.setConfigurationInternal(SecurityUtil.java:96)
        at org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:85)
        at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:306)
        at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:355)
        at org.apache.spark.deploy.SparkHadoopUtil.<init>(SparkHadoopUtil.scala:51)
        at org.apache.spark.deploy.SparkHadoopUtil$.instance$lzycompute(SparkHadoopUtil.scala:398)
        at org.apache.spark.deploy.SparkHadoopUtil$.instance(SparkHadoopUtil.scala:398)
        at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:471)
        at org.apache.spark.deploy.history.HistoryServer$.initSecurity(HistoryServer.scala:352)
        at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:298)
        at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)


24/01/10 14:14:09 DEBUG OpensslCipher: Failed to load OpenSSL Cipher.
java.lang.UnsatisfiedLinkError: 'boolean org.apache.hadoop.util.NativeCodeLoader.buildSupportsOpenssl()'
        at org.apache.hadoop.util.NativeCodeLoader.buildSupportsOpenssl(Native Method)
        at org.apache.hadoop.crypto.OpensslCipher.<clinit>(OpensslCipher.java:84)
        at org.apache.hadoop.crypto.OpensslAesCtrCryptoCodec.<init>(OpensslAesCtrCryptoCodec.java:50)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
        at org.apache.hadoop.crypto.CryptoCodec.getInstance(CryptoCodec.java:69)
        at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:387)
        at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:606)
        at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getEncryptedStreams(SaslDataTransferClient.java:342)
        at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:276)
        at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:245)
        at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:227)
        at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:170)
        at org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:657)
        at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2910)
        at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:828)
        at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753)
        at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:387)
        at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:734)
        at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:681)
        at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:971)
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1029)
        at java.base/java.io.DataInputStream.read(DataInputStream.java:149)
        at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:252)
        at java.base/java.io.BufferedInputStream.read1(BufferedInputStream.java:292)
        at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:351)
        at net.jpountz.lz4.LZ4BlockInputStream.tryReadFully(LZ4BlockInputStream.java:271)
        at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:192)
        at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:159)
        at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
        at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
        at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
        at java.base/java.io.InputStreamReader.read(InputStreamReader.java:181)
        at java.base/java.io.BufferedReader.fill(BufferedReader.java:161)
        at java.base/java.io.BufferedReader.readLine(BufferedReader.java:326)
        at java.base/java.io.BufferedReader.readLine(BufferedReader.java:392)
        at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:74)
        at scala.collection.Iterator$$anon$20.hasNext(Iterator.scala:886)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
        at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:81)
        at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:59)
        at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$parseAppEventLogs$3(FsHistoryProvider.scala:1140)
        at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$parseAppEventLogs$3$adapted(FsHistoryProvider.scala:1138)
        at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2786)
        at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$parseAppEventLogs$1(FsHistoryProvider.scala:1138)
        at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$parseAppEventLogs$1$adapted(FsHistoryProvider.scala:1136)
        at scala.collection.immutable.List.foreach(List.scala:431)
        at org.apache.spark.deploy.history.FsHistoryProvider.parseAppEventLogs(FsHistoryProvider.scala:1136)
        at org.apache.spark.deploy.history.FsHistoryProvider.doMergeApplicationListingInternal(FsHistoryProvider.scala:796)
        at org.apache.spark.deploy.history.FsHistoryProvider.doMergeApplicationListing(FsHistoryProvider.scala:765)
        at org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:714)
        at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$18(FsHistoryProvider.scala:581)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)


24/01/10 14:14:09 INFO FsHistoryProvider: Finished parsing hdfs://analytics-hadoop/var/log/spark/application_1704805203730_5123.lz4
24/01/10 14:14:12 DEBUG PeerCache: Exception in closing EncryptedPeer(NioInetPeer(Socket[unconnected]))
java.lang.NoSuchMethodError: 'sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner()'
        at org.apache.hadoop.crypto.CryptoStreamUtils.freeDB(CryptoStreamUtils.java:40)
        at org.apache.hadoop.crypto.CryptoOutputStream.freeBuffers(CryptoOutputStream.java:311)
        at org.apache.hadoop.crypto.CryptoOutputStream.close(CryptoOutputStream.java:245)
        at org.apache.hadoop.hdfs.net.EncryptedPeer.close(EncryptedPeer.java:96)
        at org.apache.hadoop.hdfs.util.IOUtilsClient.cleanup(IOUtilsClient.java:36)
        at org.apache.hadoop.hdfs.PeerCache.evictExpired(PeerCache.java:224)
        at org.apache.hadoop.hdfs.PeerCache.run(PeerCache.java:256)
        at org.apache.hadoop.hdfs.PeerCache.access$000(PeerCache.java:46)
        at org.apache.hadoop.hdfs.PeerCache$1.run(PeerCache.java:124)
        at java.base/java.lang.Thread.run(Thread.java:829)
24/01/10 14:14:15 DEBUG PeerCache: Exception in closing EncryptedPeer(NioInetPeer(Socket[unconnected]))
java.lang.NoSuchMethodError: 'sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner()'
        at org.apache.hadoop.crypto.CryptoStreamUtils.freeDB(CryptoStreamUtils.java:40)
        at org.apache.hadoop.crypto.CryptoOutputStream.freeBuffers(CryptoOutputStream.java:311)
        at org.apache.hadoop.crypto.CryptoOutputStream.close(CryptoOutputStream.java:245)
        at org.apache.hadoop.hdfs.net.EncryptedPeer.close(EncryptedPeer.java:96)
        at org.apache.hadoop.hdfs.util.IOUtilsClient.cleanup(IOUtilsClient.java:36)
        at org.apache.hadoop.hdfs.PeerCache.evictExpired(PeerCache.java:224)
        at org.apache.hadoop.hdfs.PeerCache.run(PeerCache.java:256)
        at org.apache.hadoop.hdfs.PeerCache.access$000(PeerCache.java:46)
        at org.apache.hadoop.hdfs.PeerCache$1.run(PeerCache.java:124)
        at java.base/java.lang.Thread.run(Thread.java:829)
24/01/10 14:14:15 DEBUG PeerCache: Exception in closing EncryptedPeer(NioInetPeer(Socket[unconnected]))
java.lang.NoSuchMethodError: 'sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner()'
        at org.apache.hadoop.crypto.CryptoStreamUtils.freeDB(CryptoStreamUtils.java:40)
        at org.apache.hadoop.crypto.CryptoOutputStream.freeBuffers(CryptoOutputStream.java:311)
        at org.apache.hadoop.crypto.CryptoOutputStream.close(CryptoOutputStream.java:245)
        at org.apache.hadoop.hdfs.net.EncryptedPeer.close(EncryptedPeer.java:96)
        at org.apache.hadoop.hdfs.util.IOUtilsClient.cleanup(IOUtilsClient.java:36)
        at org.apache.hadoop.hdfs.PeerCache.evictExpired(PeerCache.java:224)
        at org.apache.hadoop.hdfs.PeerCache.run(PeerCache.java:256)
        at org.apache.hadoop.hdfs.PeerCache.access$000(PeerCache.java:46)
        at org.apache.hadoop.hdfs.PeerCache$1.run(PeerCache.java:124)
        at java.base/java.lang.Thread.run(Thread.java:829)

I'm also seeing this in the startup logs

24/01/10 15:07:11 DEBUG NativeCodeLoader: Trying to load the custom-built native-hadoop library...
24/01/10 15:07:11 DEBUG NativeCodeLoader: Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path: [/usr/java/packages/lib, /usr/lib/x86_64-linux-gnu/jni, /lib/x86_64-linux-gnu, /usr/lib/x86_64-linux-gnu, /usr/lib/jni, /lib, /usr/lib]
24/01/10 15:07:11 DEBUG NativeCodeLoader: java.library.path=/usr/java/packages/lib:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
24/01/10 15:07:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/01/10 15:07:11 DEBUG PerformanceAdvisory: Falling back to shell based

I wonder if the default hadoop implementation can't deal with "large" files.

I actually found hadoop jar files in the spark container:

runuser@spark-history-analytics-hadoop-6fd7bd8c84-hspm2:/srv/app$ find / -name "hadoop*jar"
find: ‘/proc/tty/driver’: Permission denied
find: ‘/etc/ssl/private’: Permission denied
find: ‘/var/cache/apt/archives/partial’: Permission denied
find: ‘/var/cache/ldconfig’: Permission denied
find: ‘/root’: Permission denied
/opt/spark/jars/hadoop-hdfs-client-2.10.2.jar
/opt/spark/jars/hadoop-common-2.10.2.jar
/opt/spark/jars/hadoop-yarn-api-2.10.2.jar
/opt/spark/jars/hadoop-yarn-server-common-2.10.2.jar
/opt/spark/jars/hadoop-mapreduce-client-app-2.10.2.jar
/opt/spark/jars/hadoop-mapreduce-client-core-2.10.2.jar
/opt/spark/jars/hadoop-auth-2.10.2.jar
/opt/spark/jars/hadoop-yarn-client-2.10.2.jar
/opt/spark/jars/hadoop-yarn-server-web-proxy-2.10.2.jar
/opt/spark/jars/hadoop-mapreduce-client-jobclient-2.10.2.jar
/opt/spark/jars/hadoop-mapreduce-client-common-2.10.2.jar
/opt/spark/jars/hadoop-yarn-registry-2.10.2.jar
/opt/spark/jars/hadoop-yarn-common-2.10.2.jar
/opt/spark/jars/hadoop-client-2.10.2.jar
/opt/spark/jars/hadoop-annotations-2.10.2.jar
/opt/spark/jars/hadoop-mapreduce-client-shuffle-2.10.2.jar

Edit: nevermind, /opt/spark/jars is indeed added to the JVM's CLASSPATH:

oot@deploy2002:/home/brouberol/spark-history# kubectl logs -l app=spark-history -c spark-history-analytics-hadoop --tail=10000 -f 
+ '[' -z '' ']'
++ java -XshowSettings:properties -version
++ awk '/java.home/ {print $3}'
+ JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
...

Right now, I think the most promising lead is this traceback:

24/01/10 14:14:15 DEBUG PeerCache: Exception in closing EncryptedPeer(NioInetPeer(Socket[unconnected]))
java.lang.NoSuchMethodError: 'sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner()'
        at org.apache.hadoop.crypto.CryptoStreamUtils.freeDB(CryptoStreamUtils.java:40)
        at org.apache.hadoop.crypto.CryptoOutputStream.freeBuffers(CryptoOutputStream.java:311)
        at org.apache.hadoop.crypto.CryptoOutputStream.close(CryptoOutputStream.java:245)
        at org.apache.hadoop.hdfs.net.EncryptedPeer.close(EncryptedPeer.java:96)
        at org.apache.hadoop.hdfs.util.IOUtilsClient.cleanup(IOUtilsClient.java:36)
        at org.apache.hadoop.hdfs.PeerCache.evictExpired(PeerCache.java:224)
        at org.apache.hadoop.hdfs.PeerCache.run(PeerCache.java:256)
        at org.apache.hadoop.hdfs.PeerCache.access$000(PeerCache.java:46)
        at org.apache.hadoop.hdfs.PeerCache$1.run(PeerCache.java:124)
        at java.base/java.lang.Thread.run(Thread.java:829)

I involves an error closing a socket to an HDFS node, which has a nice smell to it. I found a couple of leads:

Looking at this traceback specifically, I found the offending line. This, alongside this issue and PR. Thse seem to indicate that

  • Hadoop 2.10 isn't compatible with JDK > 8
  • Hadoop 3.2.0 is compatible with JDK > 8
  • https://github.com/apache/hadoop/pull/4080 is still open to remove this error when using Hadoop 2.10 with JDK > 8, but I don't know if this is the only step to support these JDK versions with Hadoop 2.10

This is confirmed in https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions !

Apache Hadoop 3.3 and upper supports Java 8 and Java 11 (runtime only)
    Please compile Hadoop with Java 8. Compiling Hadoop with Java 11 is not supported:  HADOOP-16795 - Java 11 compile support Open 
Apache Hadoop from 3.0.x to 3.2.x now supports only Java 8
Apache Hadoop from 2.7.x to 2.10.x support both Java 7 and 8

We're indeed running Java 11 with Hadoop-HDFS 2.10

runuser@spark-history-analytics-hadoop-5f9bfdb964-gcq4g:/srv/app$ java -version
openjdk version "11.0.21" 2023-10-17
OpenJDK Runtime Environment (build 11.0.21+9-post-Debian-1deb11u1)
OpenJDK 64-Bit Server VM (build 11.0.21+9-post-Debian-1deb11u1, mixed mode, sharing)
runuser@spark-history-analytics-hadoop-5f9bfdb964-gcq4g:/srv/app$ ls /opt/spark/jars/hadoop-hdfs-*
/opt/spark/jars/hadoop-hdfs-client-2.10.2.jar

So we either need to:

  • revert to using OpenJDK 8
  • upgrade to Hadoop 3.3 (can we?)

cc @BTullis