Page MenuHomePhabricator

Hadoop Yarn Nodemanagers occasionally die with OOM / GC errors
Closed, ResolvedPublic

Description

Occasionally nodemanager process die. This seems to happen more often when the cluster is very busy:

2015-06-18 10:36:11,004 FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Container Monitor,5,main] threw an Error.  Shutting down now...
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.io.BufferedReader.<init>(BufferedReader.java:98)
        at java.io.BufferedReader.<init>(BufferedReader.java:109)
        at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.constructProcessInfo(ProcfsBasedProcessTree.java:525)
        at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.updateProcessTree(ProcfsBasedProcessTree.java:223)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:439)

I'm not 100% sure that this is the NodeManager process itself dying, or just a container. However, the *.out file has:

Caused by: java.lang.InterruptedException
	at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:400)
	at java.util.concurrent.FutureTask.get(FutureTask.java:187)
	at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1046)
	at org.apache.hadoop.ipc.Client.call(Client.java:1441)
	... 8 more
Halting due to Out Of Memory Error...

and the process does die. Puppet starts it back up again a few minutes later. Hm.

Event Timeline

Ottomata claimed this task.
Ottomata raised the priority of this task from to Needs Triage.
Ottomata updated the task description. (Show Details)
Ottomata added a project: Analytics-Clusters.
Ottomata added subscribers: Ottomata, JAllemandou, Gage.

Just got some more logs from one of these crashes:

2015-06-18 16:04:23,865 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Failed to launch container.
2015-06-18 16:04:23,415 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_e07_1430945266892_75677_01_000037
2015-06-18 16:04:20,110 FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ResponseProcessor for block BP-1552854784-10.64.21.110-1405114489661:blk_1099476362_25737245,5,main] threw an Error.  Shutting down now...
java.lang.OutOfMemoryError: Java heap space
2015-06-18 16:04:03,830 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_e07_1430945266892_75177_01_007685
2015-06-18 16:05:19,626 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hdfs	IP=10.64.53.17	OPERATION=Stop Container Request	TARGET=ContainerManageImpl	RESULT=SUCCESS	APPID=application_1430945266892_75677	CONTAINERID=container_e07_1430945266892_75677_01_000037
2015-06-18 16:04:56,574 ERROR org.mortbay.log: handle failed
java.lang.OutOfMemoryError: Java heap space
2015-06-18 16:04:25,278 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Uploading logs for container container_e07_1430945266892_75526_01_001633. Current good log dirs are /var/lib/hadoop/data/b/yarn/logs,/var/lib/hadoop/data/c/yarn/logs,/var/lib/hadoop/data/d/yarn/logs,/var/lib/hadoop/data/e/yarn/logs,/var/lib/hadoop/data/f/yarn/logs,/var/lib/hadoop/data/g/yarn/logs,/var/lib/hadoop/data/h/yarn/logs,/var/lib/hadoop/data/i/yarn/logs,/var/lib/hadoop/data/j/yarn/logs,/var/lib/hadoop/data/k/yarn/logs,/var/lib/hadoop/data/l/yarn/logs,/var/lib/hadoop/data/m/yarn/logs
2015-06-18 16:05:55,864 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
java.lang.OutOfMemoryError: Java heap space
2015-06-18 16:05:56,807 FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[DiskHealthMonitor-Timer,5,main] threw an Error.  Shutting down now...
java.lang.OutOfMemoryError: Java heap space
2015-06-18 16:05:54,481 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Failed to launch container.
java.lang.OutOfMemoryError: Java heap space
2015-06-18 16:05:54,481 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=yurik	IP=10.64.36.130	OPERATION=Stop Container Request	TARGET=ContainerManageImpl	RESULT=SUCCESS	APPID=application_1430945266892_75177	CONTAINERID=container_e07_1430945266892_75177_01_007685
2015-06-18 16:05:59,801 FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[RMI TCP Connection(idle),5,RMI Runtime] threw an Error.  Shutting down now...
java.lang.OutOfMemoryError: Java heap space
2015-06-18 16:05:54,481 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Failed to launch container.
java.lang.OutOfMemoryError: Java heap space
2015-06-18 16:06:02,686 INFO org.apache.hadoop.io.retry.RetryInvocationHandler: Exception while invoking renewLease of class ClientNamenodeProtocolTranslatorPB over analytics1001.eqiad.wmnet/10.64.36.118:8020. Trying to fail over immediately.
java.io.IOException: Failed on local exception: java.io.IOException: Couldn't set up IO streams; Host Details : local host is: "analytics1029/10.64.36.129"; destination host is: "analytics1001.eqiad.wmnet":8020;
	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
	at org.apache.hadoop.ipc.Client.call(Client.java:1472)
	at org.apache.hadoop.ipc.Client.call(Client.java:1399)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
	at com.sun.proxy.$Proxy32.renewLease(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:571)
	at sun.reflect.GeneratedMethodAccessor157.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
	at com.sun.proxy.$Proxy33.renewLease(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:876)
	at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417)
	at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442)
	at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
	at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Couldn't set up IO streams
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:786)
	at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
	at org.apache.hadoop.ipc.Client.call(Client.java:1438)
	... 16 more
Caused by: java.lang.OutOfMemoryError: Java heap space
2015-06-18 16:06:01,287 FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[RMI TCP Connection(idle),5,RMI Runtime] threw an Error.  Shutting down now...
java.lang.OutOfMemoryError: Java heap space
2015-06-18 16:06:01,286 FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[RMI TCP Connection(idle),5,RMI Runtime] threw an Error.  Shutting down now...
java.lang.OutOfMemoryError: Java heap space
2015-06-18 16:05:59,801 FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[RMI TCP Connection(idle),5,RMI Runtime] threw an Error.  Shutting down now...
java.lang.OutOfMemoryError: Java heap space
2015-06-18 16:06:04,050 INFO org.apache.hadoop.util.ExitUtil: Halt with status -1 Message: HaltException
2015-06-18 16:05:58,873 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1430945266892_75321_000001 (auth:SIMPLE)
2015-06-18 16:06:04,050 INFO org.apache.hadoop.mapred.ShuffleHandler: Setting connection close header...

Opened https://phabricator.wikimedia.org/T132559 before checking phab, today multiple hosts showed this behavior and I had to restart Yarn manually

elukey triaged this task as High priority.Apr 13 2016, 1:32 PM
elukey added a project: Analytics-Kanban.

This clearly doesn't happen often. In lieu of not knowing what causes this, it wouldn't hurt to increase NodeManager memory a little, eh?

Seeing some interesting logs on an56 which had a NodeManager just die. Not sure if related:

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException): The directory item limit of /var/log/hadoop-yarn/apps/hdfs/logs is exceeded: limit=1048576 items=1048576
...
2016-04-13 13:26:30,996 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Log Aggregation service failed to initialize, there will be no logs for this application
...
2016-04-13 13:28:22,163 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished.
...
2016-04-13 13:28:24,210 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService: Log aggregation is not initialized for container_e17_1458129362008_86595_01_001879, did it fail to start?
...
2016-04-13 13:30:41,023 FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[RMI TCP Connection(idle),5,RMI Runtime] threw an Error.  Shutting down now...
java.lang.OutOfMemoryError: Java heap space
Ottomata set Security to None.

Removing this from analytics-kanban. We upped the JVM heap for these processes to 2048m. Will have to wait and see if this happens again.

The new heap size seems to have solved the problem completely, closing the task.