I noticed four failures of the Yarn Node Managers during the past days, all with the same signature:
elukey@an-worker1123:~$ sudo grep yarn /var/log/syslog.1 May 3 21:59:14 an-worker1123 systemd[1]: hadoop-yarn-nodemanager.service: Main process exited, code=exited, status=255/EXCEPTION <========== May 3 21:59:14 an-worker1123 systemd[1]: hadoop-yarn-nodemanager.service: Failed with result 'exit-code'. elukey@an-worker1123:~$ grep java.lang.OutOfMemory /var/log/hadoop-yarn/yarn-yarn-nodemanager-an-worker1123.log -B 10 -A 10 --color 2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Getting exit code file... 2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Creating script paths... 2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Writing pid file... 2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Writing to tmp file /var/lib/hadoop/data/h/yarn/local/nmPrivate/application_1619507802557_26647/container_e11_1619507802557_26647_01_001813/container_e11_1619507802557_26647_01_001813.pid.tmp 2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Writing to cgroup task files... 2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Creating local dirs... 2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Launching container... 2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Getting exit code file... 2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Creating script paths... 2021-05-03 21:59:08,736 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container launch failed : Container exited with a non-zero exit code 1. 2021-05-03 21:59:08,738 ERROR org.apache.hadoop.util.Shell: Caught java.lang.OutOfMemoryError: unable to create new native thread. One possible reason is that ulimit setting of 'max user processes' is too low. If so, do 'ulimit -u <largerNum>' and try again. <=================== 2021-05-03 21:59:08,749 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 1. Privileged Execution Operation Stderr: Stdout: main : command provided 1 main : run as user is jiawang main : requested yarn user is jiawang Getting exit code file... Creating script paths... Writing pid file... Writing to tmp file /var/lib/hadoop/data/b/yarn/local/nmPrivate/application_1619507802557_26647/container_e11_1619507802557_26647_01_001809/container_e11_1619507802557_26647_01_001809.pid.tmp Writing to cgroup task files... -- 2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Creating script paths... 2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Writing pid file... 2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Writing to tmp file /var/lib/hadoop/data/j/yarn/local/nmPrivate/application_1619507802557_26647/container_e11_1619507802557_26647_01_001806/container_e11_1619507802557_26647_01_001806.pid.tmp 2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Writing to cgroup task files... 2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Creating local dirs... 2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Launching container... 2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Getting exit code file... 2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Creating script paths... 2021-05-03 21:59:08,753 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container launch failed : Container exited with a non-zero exit code 1. 2021-05-03 21:59:08,758 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread <=================== java.lang.OutOfMemoryError: unable to create new native thread <=================== at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at org.apache.hadoop.util.Shell.runCommand(Shell.java:975) at org.apache.hadoop.util.Shell.run(Shell.java:884) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1216) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:147) at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:138) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:687) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:697)
In this case, afaics, the yarn app id that may have caused this is: https://yarn.wikimedia.org/jobhistory/job/job_1619507802557_26647 (big query to pageview data)
The other occurrences of the failures happened some days ago so there seems to be no history for them (we may need to increase retention in Zookeeper).
I am not clear why this problem happens, I see that there is a lot of space for threads in the Yarn NM's cgroup:
elukey@an-worker1131:~$ sudo systemctl status hadoop-yarn-nodemanager.service | grep Tasks Tasks: 707 (limit: 11059)
Also:
elukey@an-worker1131:~$ cat /proc/$(pgrep -f nodemanager)/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes 65536 65536 processes Max open files 32768 32768 files Max locked memory 67108864 67108864 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 771839 771839 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us
I suspect it is a heavy job that hits this resource constraint, but in theory Yarn should prevent it from happening in the first place (maybe we are lacking some config?).