I noticed four failures of the Yarn Node Managers during the past days, all with the same signature:
```
elukey@an-worker1123:~$ sudo grep yarn /var/log/syslog.1
May 3 21:59:14 an-worker1123 systemd[1]: hadoop-yarn-nodemanager.service: Main process exited, code=exited, status=255/EXCEPTION <==========
May 3 21:59:14 an-worker1123 systemd[1]: hadoop-yarn-nodemanager.service: Failed with result 'exit-code'.
elukey@an-worker1123:~$ grep java.lang.OutOfMemory /var/log/hadoop-yarn/yarn-yarn-nodemanager-an-worker1123.log -B 10 -A 10 --color
2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Getting exit code file...
2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Creating script paths...
2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Writing pid file...
2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Writing to tmp file /var/lib/hadoop/data/h/yarn/local/nmPrivate/application_1619507802557_26647/container_e11_1619507802557_26647_01_001813/container_e11_1619507802557_26647_01_001813.pid.tmp
2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Writing to cgroup task files...
2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Creating local dirs...
2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Launching container...
2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Getting exit code file...
2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Creating script paths...
2021-05-03 21:59:08,736 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container launch failed : Container exited with a non-zero exit code 1.
2021-05-03 21:59:08,738 ERROR org.apache.hadoop.util.Shell: Caught java.lang.OutOfMemoryError: unable to create new native thread. One possible reason is that ulimit setting of 'max user processes' is too low. If so, do 'ulimit -u <largerNum>' and try again. <===================
2021-05-03 21:59:08,749 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 1. Privileged Execution Operation Stderr:
Stdout: main : command provided 1
main : run as user is jiawang
main : requested yarn user is jiawang
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /var/lib/hadoop/data/b/yarn/local/nmPrivate/application_1619507802557_26647/container_e11_1619507802557_26647_01_001809/container_e11_1619507802557_26647_01_001809.pid.tmp
Writing to cgroup task files...
--
2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Creating script paths...
2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Writing pid file...
2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Writing to tmp file /var/lib/hadoop/data/j/yarn/local/nmPrivate/application_1619507802557_26647/container_e11_1619507802557_26647_01_001806/container_e11_1619507802557_26647_01_001806.pid.tmp
2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Writing to cgroup task files...
2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Creating local dirs...
2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Launching container...
2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Getting exit code file...
2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Creating script paths...
2021-05-03 21:59:08,753 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container launch failed : Container exited with a non-zero exit code 1.
2021-05-03 21:59:08,758 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread <===================
java.lang.OutOfMemoryError: unable to create new native thread <===================
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:975)
at org.apache.hadoop.util.Shell.run(Shell.java:884)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1216)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:147)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:138)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:687)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:697)
```
In this case, afaics, the yarn app id that may have caused this is: https://yarn.wikimedia.org/jobhistory/job/job_1619507802557_26647 (big query to pageview data)
The other occurrences of the failures happened some days ago so there seems to be no history for them (we may need to increase retention in Zookeeper).
I am not clear why this problem happens, I see that there is a lot of space for threads in the Yarn NM's cgroup:
```
elukey@an-worker1131:~$ sudo systemctl status hadoop-yarn-nodemanager.service | grep Tasks
Tasks: 707 (limit: 11059)
```
I suspect it is a heavy job that hits this resource constraint, but in theory Yarn should prevent it from happening in the first place (maybe we are lacking some config?).