Page MenuHomePhabricator

Enable more accurate smaps based RSS tracking by yarn nodemanager
Closed, ResolvedPublic3 Estimated Story Points

Description

When training MLR models I'm finding we have to massively over-allocate memory to the containers, much more than they need. Some custom monitoring (https://gerrit.wikimedia.org/r/395915 and https://gerrit.wikimedia.org/r/395592) shows that RSS as measured by the more accurate algorithm keeps a steady state between 6-7GB, but the default algorithm which reads only the standard RSS value from /proc/$pid/stat increases to the point that yarn kills containers unless we over-allocate memory by almost double (13GB is a reasonable number to keep containers from dieing).

Best i can tell from tracking is that RssFile (/proc/$pid/status) typically is about 50MB, but at times it will grow at 50-100MB/s up to 5GB or so. Because these pages are backed by files they are not really taking memory, at least not memory that is unavailable to linux. The linux kernel can reclaim these pages at any moment that is convenient to it.

Example log from spark executor running xgboost to do training: P6440
Nodemanager log (filtered for the container in question): P6441

Most relevant is probably the message from nodemanager when it kills the container:

2017-12-07 04:28:29,257 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=25864,containerID=container_e54_1512469367986_4908_01_000007] is running beyond physical memory limits. Current usage: 10.2 GB of 10 GB physical memory used; 12.6 GB of 21 GB virtual memory used. Killing container.

Most recent log from my custom resource monitor running in the executor:

17/12/07 04:28:20 INFO ResourceMonitorThread: RssAnon:	 6245908 kB, RssFile:	 4076692 kB, RssShmem:	       0 kB
17/12/07 04:28:20 INFO ResourceMonitorThread: init = 1054087168(1029382K) used = 2255993464(2203118K) committed = 3975675904(3882496K) max = 3975675904(3882496K)
17/12/07 04:28:20 INFO ResourceMonitorThread: smapinfo mem: 6257072 size: 11873596 rss: 10340516 length: 1171

Here RssAnon, RssFile and RssShmem come from /proc/$pid/status. The second line with init=... is from the jvm's information about memory usage. Basically it's just saying it has a 4GB heap (and probably a little more off-heap). The final line is calculated the same as yarns smaps based calculation. this is all from /proc/$pid/smaps. the mem: 6257072 is the number of kB the more accurate count expects is used, while size and rss are the the less accurate counts, and length is the number of maps the process has open.

Event Timeline

Change 395923 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet/cdh@master] Enable more accurate smaps based rss checking

https://gerrit.wikimedia.org/r/395923

I suppose for a little more background on what i think is happening:

  • The executors that die seem to be the ones that are caching training rdd's
  • We run 30 to 100 model trainings in parallel, meaning that the couple executors that are initially caching the rdds are asked for the same data by many other executors
  • These rdds are a bit big and don't fit in memory, so most commonly they are serialized and stored on disk

My theory:

  • Each time spark is asked to serve up a cached rdd from disk it maps it, and then streams it out to the other executor
  • The reason we typically see executors dieing shortly after training starts is that, since they have the data locally, they start training first. While training they are also asked to ship the data to other nodes.
  • RssFile memory usage grows by a multiple of the data size and the number of executors requesting the data.

Change 395923 merged by Elukey:
[operations/puppet/cdh@master] Enable more accurate smaps based rss checking

https://gerrit.wikimedia.org/r/395923

Mentioned in SAL (#wikimedia-operations) [2017-12-13T14:01:17Z] <elukey> restart Yarn nodemanagers on analytics102[8,9] to apply new settings - T182276

elukey triaged this task as Medium priority.Dec 13 2017, 2:19 PM

Mentioned in SAL (#wikimedia-operations) [2017-12-15T10:31:17Z] <elukey> rolling restart of yarn nodemanagers on an103* to apply new config - T182276

Mentioned in SAL (#wikimedia-operations) [2017-12-18T08:57:24Z] <elukey> rolling restart of the Yarn nodemanagers (hadoop) on analytics10[456]* to pick up new settings - T182276

@EBernhardson the config should be now live everywhere, I keep checking metrics in https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1 but I didn't see any weird change in behavior for the yarn node managers. Let me know if you see anything different or if anything is missing before closing!

elukey set the point value for this task to 3.Dec 18 2017, 2:27 PM

Thanks! I'll try this out this week and see how things go.