Page MenuHomePhabricator

Yarn NM stopping due to failures while creating native threads
Closed, ResolvedPublic

Description

I noticed four failures of the Yarn Node Managers during the past days, all with the same signature:

elukey@an-worker1123:~$ sudo grep yarn /var/log/syslog.1
May  3 21:59:14 an-worker1123 systemd[1]: hadoop-yarn-nodemanager.service: Main process exited, code=exited, status=255/EXCEPTION   <==========
May  3 21:59:14 an-worker1123 systemd[1]: hadoop-yarn-nodemanager.service: Failed with result 'exit-code'.

elukey@an-worker1123:~$ grep java.lang.OutOfMemory /var/log/hadoop-yarn/yarn-yarn-nodemanager-an-worker1123.log -B 10 -A 10 --color
2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Getting exit code file...
2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Creating script paths...
2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Writing pid file...
2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Writing to tmp file /var/lib/hadoop/data/h/yarn/local/nmPrivate/application_1619507802557_26647/container_e11_1619507802557_26647_01_001813/container_e11_1619507802557_26647_01_001813.pid.tmp
2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Writing to cgroup task files...
2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Creating local dirs...
2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Launching container...
2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Getting exit code file...
2021-05-03 21:59:08,736 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Creating script paths...
2021-05-03 21:59:08,736 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container launch failed : Container exited with a non-zero exit code 1. 

2021-05-03 21:59:08,738 ERROR org.apache.hadoop.util.Shell: Caught java.lang.OutOfMemoryError: unable to create new native thread. One possible reason is that ulimit setting of 'max user processes' is too low. If so, do 'ulimit -u <largerNum>' and try again. <===================

2021-05-03 21:59:08,749 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 1. Privileged Execution Operation Stderr: 

Stdout: main : command provided 1
main : run as user is jiawang
main : requested yarn user is jiawang
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /var/lib/hadoop/data/b/yarn/local/nmPrivate/application_1619507802557_26647/container_e11_1619507802557_26647_01_001809/container_e11_1619507802557_26647_01_001809.pid.tmp
Writing to cgroup task files...
--
2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Creating script paths...
2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Writing pid file...
2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Writing to tmp file /var/lib/hadoop/data/j/yarn/local/nmPrivate/application_1619507802557_26647/container_e11_1619507802557_26647_01_001806/container_e11_1619507802557_26647_01_001806.pid.tmp
2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Writing to cgroup task files...
2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Creating local dirs...
2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Launching container...
2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Getting exit code file...
2021-05-03 21:59:08,753 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Creating script paths...
2021-05-03 21:59:08,753 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container launch failed : Container exited with a non-zero exit code 1. 

2021-05-03 21:59:08,758 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread  <===================

java.lang.OutOfMemoryError: unable to create new native thread  <===================
	at java.lang.Thread.start0(Native Method)
	at java.lang.Thread.start(Thread.java:717)
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:975)
	at org.apache.hadoop.util.Shell.run(Shell.java:884)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1216)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:152)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.signalContainer(DefaultLinuxContainerRuntime.java:147)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:138)
	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:687)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:697)

In this case, afaics, the yarn app id that may have caused this is: https://yarn.wikimedia.org/jobhistory/job/job_1619507802557_26647 (big query to pageview data)

The other occurrences of the failures happened some days ago so there seems to be no history for them (we may need to increase retention in Zookeeper).

I am not clear why this problem happens, I see that there is a lot of space for threads in the Yarn NM's cgroup:

elukey@an-worker1131:~$ sudo systemctl status hadoop-yarn-nodemanager.service  | grep Tasks
    Tasks: 707 (limit: 11059)

Also:

elukey@an-worker1131:~$ cat /proc/$(pgrep -f nodemanager)/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             65536                65536                processes 
Max open files            32768                32768                files     
Max locked memory         67108864             67108864             bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       771839               771839               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us

I suspect it is a heavy job that hits this resource constraint, but in theory Yarn should prevent it from happening in the first place (maybe we are lacking some config?).

Event Timeline

Change 685314 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] bigtop::hadoop::nodemanager: apply systemd override to service

https://gerrit.wikimedia.org/r/685314

Change 685762 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] hadoop: force Yarn to use DominantResourceCalculator

https://gerrit.wikimedia.org/r/685762

Change 685762 merged by Elukey:

[operations/puppet@production] hadoop: force Yarn to use DominantResourceCalculator

https://gerrit.wikimedia.org/r/685762

Mentioned in SAL (#wikimedia-analytics) [2021-05-06T12:39:28Z] <elukey> restart Yarn RMs to apply the dominant resource calculator setting - T281792

Change 685314 merged by Elukey:

[operations/puppet@production] bigtop::hadoop::nodemanager: apply systemd override to service

https://gerrit.wikimedia.org/r/685314

After a chat with Joseph we decided to proceed one change at the time:

We are using https://hue.wikimedia.org/hue/jobbrowser/#!id=0011324-210426062240701-oozie-oozi-C as quick test to see if we solved the problem or not. The issue in the coordinator is that some containers end up in the java.lang.OutOfMemory - no more native threads available state, that is similar to the one highlighted in the description but not the same (since the error happens at the Yarn NM process level, not at the container one).

No errors for native threads registered in the past hours, it looks that we are out of the woods, but I'll wait until Monday before declaring victory.

elukey triaged this task as High priority.
elukey added a project: Analytics-Kanban.
elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.

Second day without any error!

No re-occurrences, closing!