Page MenuHomePhabricator

Yarn node manager JVM memory leaks
Closed, ResolvedPublic

Description

While working on T88640 we noticed that the JVM heap usage of the Yarn node manager daemons on all the Hadoop worker nodes show a pattern of steady memory allocation over time. Eventually this leads to frequent Old Generation GC collections in the best case, and OutOfMemory in the worst one.

We collected a heap dump on analytics1034 (/tmp/an1034.hprof) and Eclipse Mat (thanks to @Gehel for the suggestion!) it seems that there are memory leaks in:

261,067 instances of "org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics", loaded by "sun.misc.Launcher$AppClassLoader @ 0x78001aa60" occupy 597,458,664 (38.89%) bytes. These instances are referenced from one instance of "java.util.HashMap$Entry[]", loaded by "<system class loader>"

Keywords
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics
java.util.HashMap$Entry[]
sun.misc.Launcher$AppClassLoader @ 0x78001aa60

The Profiler suggests:

  • java.util.TimerThread @ 0x7801c3ee8 Timer for 'NodeManager' metrics system Thread
  • org.apache.hadoop.metrics2.impl.MetricsSystemImpl$4 @ 0x7801c3eb0
  • java.util.HashMap @ 0x7801bc1e8

And also:

36 instances of "io.netty.buffer.PoolChunk", loaded by "sun.misc.Launcher$AppClassLoader @ 0x78001aa60" occupy 302,621,616 (19.70%) bytes.

Biggest instances:

io.netty.buffer.PoolChunk @ 0x790e9d1b8 - 16,794,176 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790f43888 - 16,794,176 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790f43918 - 16,794,176 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x781e08180 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790e44de8 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790e44e38 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790e51288 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790e54488 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790e5a438 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790e792f0 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790ed3b98 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790ee1d48 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790f1f9d8 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790fa6b20 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x793f35ca0 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790e52ff0 - 16,793,888 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790e775c8 - 16,793,888 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x793ef5bd8 - 16,793,888 (1.09%) bytes.


Keywords
io.netty.buffer.PoolChunk
sun.misc.Launcher$AppClassLoader @ 0x78001aa60

Event Timeline

elukey triaged this task as Medium priority.EditedDec 22 2016, 4:14 PM
elukey added a subscriber: Ottomata.

15:51 <elukey> !log restarting the yarn node manager java daemons on all the Hadoop worker nodes due to suspect memory leak

node_managers_restart.png (831×1 px, 649 KB)

https://issues.apache.org/jira/browse/HADOOP-13362 (https://issues.apache.org/jira/browse/YARN-5482 has a more simple description and points to it) seems to be a likely candidate for the root cause of the issue. The affected version is 2.7, but I read it as "up to 2.7" rather than "only on 2.7"..

Anyhow, from http://www.cloudera.com/documentation/enterprise/release-notes/PDF/cloudera-releases.pdf it seems that the fix for HADOOP-13362 is shipped in CDH 5.9

@Ottomata what do you think?

Nice find! Let's keep an eye on this and hope that they release something with Spark 2.0 soon so we can do an upgrade.

Also here: https://issues.apache.org/jira/browse/HADOOP-11105 regarding class: org.apache.hadoop.metrics2.impl.MetricsSystemImpl which is retaining a lot of memory

Screen Shot 2017-01-04 at 1.39.10 PM.png (1×1 px, 513 KB)

on 2.6 a new method gets added called unregisterSource that is basically removing callbacks from the array in which they are stored:

Base Class:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-common/2.6.0/org/apache/hadoop/metrics2/MetricsSystem.java#MetricsSystem.unregisterSource%28java.lang.String%29

Implementation:

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-common/2.6.0/org/apache/hadoop/metrics2/impl/MetricsSystemImpl.java#MetricsSystemImpl.unregisterSource%28java.lang.String%29

Fix for this issue is on 2.6 CDH 5.3.8 so this one we should already have

Also here: https://issues.apache.org/jira/browse/HADOOP-11105 regarding class: org.apache.hadoop.metrics2.impl.MetricsSystemImpl which is retaining a lot of memory

Screen Shot 2017-01-04 at 1.39.10 PM.png (1×1 px, 513 KB)

Need to look at this some more

This fix has been shipped in CDH 5.3.8 from what I can see in http://www.cloudera.com/documentation/enterprise/release-notes/PDF/cloudera-releases.pdf. Maybe something related but hidden in another bug report?

Ok, fix for https://issues.apache.org/jira/browse/YARN-5482 is also in the same method: unregisterSource, that is why memory seems to leak from the MetricsSystemImpl class:
https://fisheye6.atlassian.com/changelog/hadoop?cs=6759cbc56a09c129f5902ddbbee7db665ca9c917

So what is happening is that the 1st fix (for https://issues.apache.org/jira/browse/HADOOP-11105) half-fixed the issue and (hopefully) the 2nd fix (https://issues.apache.org/jira/browse/YARN-5482) truly takes care of it.

Milimetric subscribed.

This will be resolved by upgrading CDH versions, which will happen soon. So... resolved in the future :)