When training MLR models I'm finding we have to massively over-allocate memory to the containers, much more than they need. Some custom monitoring (https://gerrit.wikimedia.org/r/395915 and https://gerrit.wikimedia.org/r/395592) shows that RSS as measured by the more accurate algorithm keeps a steady state between 6-7GB, but the default algorithm which reads only the standard RSS value from /proc/$pid/stat increases to the point that yarn kills containers unless we over-allocate memory by almost double (13GB is a reasonable number to keep containers from dieing).
Best i can tell from tracking is that RssFile (/proc/$pid/status) typically is about 50MB, but at times it will grow at 50-100MB/s up to 5GB or so. Because these pages are backed by files they are not really taking memory, at least not memory that is unavailable to linux. The linux kernel can reclaim these pages at any moment that is convenient to it.
Example log from spark executor running xgboost to do training: P6440
Nodemanager log (filtered for the container in question): P6441
Most relevant is probably the message from nodemanager when it kills the container:
2017-12-07 04:28:29,257 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=25864,containerID=container_e54_1512469367986_4908_01_000007] is running beyond physical memory limits. Current usage: 10.2 GB of 10 GB physical memory used; 12.6 GB of 21 GB virtual memory used. Killing container.
Most recent log from my custom resource monitor running in the executor:
17/12/07 04:28:20 INFO ResourceMonitorThread: RssAnon: 6245908 kB, RssFile: 4076692 kB, RssShmem: 0 kB 17/12/07 04:28:20 INFO ResourceMonitorThread: init = 1054087168(1029382K) used = 2255993464(2203118K) committed = 3975675904(3882496K) max = 3975675904(3882496K) 17/12/07 04:28:20 INFO ResourceMonitorThread: smapinfo mem: 6257072 size: 11873596 rss: 10340516 length: 1171
Here RssAnon, RssFile and RssShmem come from /proc/$pid/status. The second line with init=... is from the jvm's information about memory usage. Basically it's just saying it has a 4GB heap (and probably a little more off-heap). The final line is calculated the same as yarns smaps based calculation. this is all from /proc/$pid/smaps. the mem: 6257072 is the number of kB the more accurate count expects is used, while size and rss are the the less accurate counts, and length is the number of maps the process has open.