Yarn node manager JVM memory leaks
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Dec 22 2016, 4:00 PM

Description

While working on T88640 we noticed that the JVM heap usage of the Yarn node manager daemons on all the Hadoop worker nodes show a pattern of steady memory allocation over time. Eventually this leads to frequent Old Generation GC collections in the best case, and OutOfMemory in the worst one.

We collected a heap dump on analytics1034 (/tmp/an1034.hprof) and Eclipse Mat (thanks to @Gehel for the suggestion!) it seems that there are memory leaks in:

261,067 instances of "org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics", loaded by "sun.misc.Launcher$AppClassLoader @ 0x78001aa60" occupy 597,458,664 (38.89%) bytes. These instances are referenced from one instance of "java.util.HashMap$Entry[]", loaded by "<system class loader>"

Keywords
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainerMetrics
java.util.HashMap$Entry[]
sun.misc.Launcher$AppClassLoader @ 0x78001aa60

The Profiler suggests:

java.util.TimerThread @ 0x7801c3ee8 Timer for 'NodeManager' metrics system Thread
org.apache.hadoop.metrics2.impl.MetricsSystemImpl$4 @ 0x7801c3eb0
java.util.HashMap @ 0x7801bc1e8

And also:

36 instances of "io.netty.buffer.PoolChunk", loaded by "sun.misc.Launcher$AppClassLoader @ 0x78001aa60" occupy 302,621,616 (19.70%) bytes.

Biggest instances:

io.netty.buffer.PoolChunk @ 0x790e9d1b8 - 16,794,176 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790f43888 - 16,794,176 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790f43918 - 16,794,176 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x781e08180 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790e44de8 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790e44e38 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790e51288 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790e54488 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790e5a438 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790e792f0 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790ed3b98 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790ee1d48 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790f1f9d8 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790fa6b20 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x793f35ca0 - 16,794,032 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790e52ff0 - 16,793,888 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x790e775c8 - 16,793,888 (1.09%) bytes.
io.netty.buffer.PoolChunk @ 0x793ef5bd8 - 16,793,888 (1.09%) bytes.


Keywords
io.netty.buffer.PoolChunk
sun.misc.Launcher$AppClassLoader @ 0x78001aa60

Related Objects

Mentioned Here: T88640: Monitor Hadoop cluster running out of HEAP space with Icinga

Event Timeline

elukey created this task.Dec 22 2016, 4:00 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 22 2016, 4:00 PM

elukey updated the task description. (Show Details)Dec 22 2016, 4:01 PM

15:51 <elukey> !log restarting the yarn node manager java daemons on all the Hadoop worker nodes due to suspect memory leak

node_managers_restart.png (831×1 px, 649 KB)

elukey moved this task from Backlog to In Progress on the User-Elukey board.Dec 23 2016, 3:11 PM

https://issues.apache.org/jira/browse/HADOOP-13362 (https://issues.apache.org/jira/browse/YARN-5482 has a more simple description and points to it) seems to be a likely candidate for the root cause of the issue. The affected version is 2.7, but I read it as "up to 2.7" rather than "only on 2.7"..

Anyhow, from http://www.cloudera.com/documentation/enterprise/release-notes/PDF/cloudera-releases.pdf it seems that the fix for HADOOP-13362 is shipped in CDH 5.9

@Ottomata what do you think?

Nice find! Let's keep an eye on this and hope that they release something with Spark 2.0 soon so we can do an upgrade.

Also here: https://issues.apache.org/jira/browse/HADOOP-11105 regarding class: org.apache.hadoop.metrics2.impl.MetricsSystemImpl which is retaining a lot of memory

Screen Shot 2017-01-04 at 1.39.10 PM.png (1×1 px, 513 KB)

on 2.6 a new method gets added called unregisterSource that is basically removing callbacks from the array in which they are stored:

Base Class:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-common/2.6.0/org/apache/hadoop/metrics2/MetricsSystem.java#MetricsSystem.unregisterSource%28java.lang.String%29

Implementation:

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-common/2.6.0/org/apache/hadoop/metrics2/impl/MetricsSystemImpl.java#MetricsSystemImpl.unregisterSource%28java.lang.String%29

Fix for this issue is on 2.6 CDH 5.3.8 so this one we should already have

In T153951#2917859, @Nuria wrote:

Also here: https://issues.apache.org/jira/browse/HADOOP-11105 regarding class: org.apache.hadoop.metrics2.impl.MetricsSystemImpl which is retaining a lot of memory

Need to look at this some more

This fix has been shipped in CDH 5.3.8 from what I can see in http://www.cloudera.com/documentation/enterprise/release-notes/PDF/cloudera-releases.pdf. Maybe something related but hidden in another bug report?

Ok, fix for https://issues.apache.org/jira/browse/YARN-5482 is also in the same method: unregisterSource, that is why memory seems to leak from the MetricsSystemImpl class:
https://fisheye6.atlassian.com/changelog/hadoop?cs=6759cbc56a09c129f5902ddbbee7db665ca9c917

So what is happening is that the 1st fix (for https://issues.apache.org/jira/browse/HADOOP-11105) half-fixed the issue and (hopefully) the 2nd fix (https://issues.apache.org/jira/browse/YARN-5482) truly takes care of it.

• Nuria moved this task from Next Up to Paused on the Analytics-Kanban board.Jan 5 2017, 5:49 PM

elukey moved this task from In Progress to Stalled on the User-Elukey board.Jan 17 2017, 9:38 AM

elukey moved this task from Stalled to In Progress on the User-Elukey board.Jan 27 2017, 12:52 PM

elukey moved this task from In Progress to Stalled on the User-Elukey board.Jan 31 2017, 3:02 PM

This will be resolved by upgrading CDH versions, which will happen soon. So... resolved in the future :)

	F5230094: Screen Shot 2017-01-04 at 1.39.10 PM.png
	Jan 4 2017, 9:42 PM

Yarn node manager JVM memory leaksClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Yarn node manager JVM memory leaks
Closed, ResolvedPublic
Actions