While working on T88640 we noticed that the JVM heap usage of the Yarn node manager daemons on all the Hadoop worker nodes show a pattern of steady memory allocation over time. Eventually this leads to frequent Old Generation GC collections in the best case, and OutOfMemory in the worst one.
We collected a heap dump on analytics1034 (/tmp/an1034.hprof) and Eclipse Mat (thanks to @Gehel for the suggestion!) it seems that there are memory leaks in:
261,067 instances of "org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics", loaded by "sun.misc.Launcher$AppClassLoader @ 0x78001aa60" occupy 597,458,664 (38.89%) bytes. These instances are referenced from one instance of "java.util.HashMap$Entry[]", loaded by "<system class loader>" Keywords org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics java.util.HashMap$Entry[] sun.misc.Launcher$AppClassLoader @ 0x78001aa60
The Profiler suggests:
- java.util.TimerThread @ 0x7801c3ee8 Timer for 'NodeManager' metrics system Thread
- org.apache.hadoop.metrics2.impl.MetricsSystemImpl$4 @ 0x7801c3eb0
- java.util.HashMap @ 0x7801bc1e8
And also:
36 instances of "io.netty.buffer.PoolChunk", loaded by "sun.misc.Launcher$AppClassLoader @ 0x78001aa60" occupy 302,621,616 (19.70%) bytes. Biggest instances: io.netty.buffer.PoolChunk @ 0x790e9d1b8 - 16,794,176 (1.09%) bytes. io.netty.buffer.PoolChunk @ 0x790f43888 - 16,794,176 (1.09%) bytes. io.netty.buffer.PoolChunk @ 0x790f43918 - 16,794,176 (1.09%) bytes. io.netty.buffer.PoolChunk @ 0x781e08180 - 16,794,032 (1.09%) bytes. io.netty.buffer.PoolChunk @ 0x790e44de8 - 16,794,032 (1.09%) bytes. io.netty.buffer.PoolChunk @ 0x790e44e38 - 16,794,032 (1.09%) bytes. io.netty.buffer.PoolChunk @ 0x790e51288 - 16,794,032 (1.09%) bytes. io.netty.buffer.PoolChunk @ 0x790e54488 - 16,794,032 (1.09%) bytes. io.netty.buffer.PoolChunk @ 0x790e5a438 - 16,794,032 (1.09%) bytes. io.netty.buffer.PoolChunk @ 0x790e792f0 - 16,794,032 (1.09%) bytes. io.netty.buffer.PoolChunk @ 0x790ed3b98 - 16,794,032 (1.09%) bytes. io.netty.buffer.PoolChunk @ 0x790ee1d48 - 16,794,032 (1.09%) bytes. io.netty.buffer.PoolChunk @ 0x790f1f9d8 - 16,794,032 (1.09%) bytes. io.netty.buffer.PoolChunk @ 0x790fa6b20 - 16,794,032 (1.09%) bytes. io.netty.buffer.PoolChunk @ 0x793f35ca0 - 16,794,032 (1.09%) bytes. io.netty.buffer.PoolChunk @ 0x790e52ff0 - 16,793,888 (1.09%) bytes. io.netty.buffer.PoolChunk @ 0x790e775c8 - 16,793,888 (1.09%) bytes. io.netty.buffer.PoolChunk @ 0x793ef5bd8 - 16,793,888 (1.09%) bytes. Keywords io.netty.buffer.PoolChunk sun.misc.Launcher$AppClassLoader @ 0x78001aa60