Occasionally nodemanager process die. This seems to happen more often when the cluster is very busy:
2015-06-18 10:36:11,004 FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Container Monitor,5,main] threw an Error. Shutting down now... java.lang.OutOfMemoryError: GC overhead limit exceeded at java.io.BufferedReader.<init>(BufferedReader.java:98) at java.io.BufferedReader.<init>(BufferedReader.java:109) at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.constructProcessInfo(ProcfsBasedProcessTree.java:525) at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.updateProcessTree(ProcfsBasedProcessTree.java:223) at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:439)
I'm not 100% sure that this is the NodeManager process itself dying, or just a container. However, the *.out file has:
Caused by: java.lang.InterruptedException at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:400) at java.util.concurrent.FutureTask.get(FutureTask.java:187) at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1046) at org.apache.hadoop.ipc.Client.call(Client.java:1441) ... 8 more Halting due to Out Of Memory Error...
and the process does die. Puppet starts it back up again a few minutes later. Hm.