Enable more accurate smaps based RSS tracking by yarn nodemanager
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	EBernhardson
	Dec 7 2017, 5:09 AM

Description

When training MLR models I'm finding we have to massively over-allocate memory to the containers, much more than they need. Some custom monitoring (https://gerrit.wikimedia.org/r/395915 and https://gerrit.wikimedia.org/r/395592) shows that RSS as measured by the more accurate algorithm keeps a steady state between 6-7GB, but the default algorithm which reads only the standard RSS value from /proc/$pid/stat increases to the point that yarn kills containers unless we over-allocate memory by almost double (13GB is a reasonable number to keep containers from dieing).

Best i can tell from tracking is that RssFile (/proc/$pid/status) typically is about 50MB, but at times it will grow at 50-100MB/s up to 5GB or so. Because these pages are backed by files they are not really taking memory, at least not memory that is unavailable to linux. The linux kernel can reclaim these pages at any moment that is convenient to it.

Example log from spark executor running xgboost to do training: P6440
Nodemanager log (filtered for the container in question): P6441

Most relevant is probably the message from nodemanager when it kills the container:

2017-12-07 04:28:29,257 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=25864,containerID=container_e54_1512469367986_4908_01_000007] is running beyond physical memory limits. Current usage: 10.2 GB of 10 GB physical memory used; 12.6 GB of 21 GB virtual memory used. Killing container.

Most recent log from my custom resource monitor running in the executor:

17/12/07 04:28:20 INFO ResourceMonitorThread: RssAnon:	 6245908 kB, RssFile:	 4076692 kB, RssShmem:	       0 kB
17/12/07 04:28:20 INFO ResourceMonitorThread: init = 1054087168(1029382K) used = 2255993464(2203118K) committed = 3975675904(3882496K) max = 3975675904(3882496K)
17/12/07 04:28:20 INFO ResourceMonitorThread: smapinfo mem: 6257072 size: 11873596 rss: 10340516 length: 1171

Here RssAnon, RssFile and RssShmem come from /proc/$pid/status. The second line with init=... is from the jvm's information about memory usage. Basically it's just saying it has a 4GB heap (and probably a little more off-heap). The final line is calculated the same as yarns smaps based calculation. this is all from /proc/$pid/smaps. the mem: 6257072 is the number of kB the more accurate count expects is used, while size and rss are the the less accurate counts, and length is the number of maps the process has open.

Details

	Subject	Repo	Branch	Lines +/-
	Enable more accurate smaps based rss checking	operations/puppet/cdh	master	+6 -0

Customize query in gerrit

Related Objects

Mentioned Here: P6440 example log of MLR training executor killed by yarn
P6441 mlr training container kill / nodemanager log

Event Timeline

EBernhardson created this task.Dec 7 2017, 5:09 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 7 2017, 5:09 AM

EBernhardson added a project: Analytics-Clusters.Dec 7 2017, 5:10 AM

EBernhardson added a subscriber: Ottomata.

Restricted Application added a project: Analytics. · View Herald TranscriptDec 7 2017, 5:10 AM

Change 395923 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet/cdh@master] Enable more accurate smaps based rss checking

https://gerrit.wikimedia.org/r/395923

gerritbot added a project: Patch-For-Review.Dec 7 2017, 5:17 AM

I suppose for a little more background on what i think is happening:

The executors that die seem to be the ones that are caching training rdd's
We run 30 to 100 model trainings in parallel, meaning that the couple executors that are initially caching the rdds are asked for the same data by many other executors
These rdds are a bit big and don't fit in memory, so most commonly they are serialized and stored on disk

My theory:

Each time spark is asked to serve up a cached rdd from disk it maps it, and then streams it out to the other executor
The reason we typically see executors dieing shortly after training starts is that, since they have the data locally, they start training first. While training they are also asked to ship the data to other nodes.
RssFile memory usage grows by a multiple of the data size and the number of executors requesting the data.

elukey added a project: User-Elukey.Dec 8 2017, 10:08 AM

• fdans assigned this task to elukey.Dec 11 2017, 4:55 PM

• fdans edited projects, added Analytics-Kanban; removed Analytics.

elukey moved this task from Backlog to Analytics Backlog on the User-Elukey board.Dec 12 2017, 8:27 AM

Change 395923 merged by Elukey:
[operations/puppet/cdh@master] Enable more accurate smaps based rss checking

https://gerrit.wikimedia.org/r/395923

Mentioned in SAL (#wikimedia-operations) [2017-12-13T14:01:17Z] <elukey> restart Yarn nodemanagers on analytics102[8,9] to apply new settings - T182276

elukey moved this task from Next Up to Ready to Deploy on the Analytics-Kanban board.Dec 13 2017, 2:14 PM

elukey triaged this task as Medium priority.Dec 13 2017, 2:19 PM

Mentioned in SAL (#wikimedia-operations) [2017-12-15T10:31:17Z] <elukey> rolling restart of yarn nodemanagers on an103* to apply new config - T182276

Mentioned in SAL (#wikimedia-operations) [2017-12-18T08:57:24Z] <elukey> rolling restart of the Yarn nodemanagers (hadoop) on analytics10[456]* to pick up new settings - T182276

elukey moved this task from Analytics Backlog to Done on the User-Elukey board.Dec 18 2017, 9:11 AM

@EBernhardson the config should be now live everywhere, I keep checking metrics in https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1 but I didn't see any weird change in behavior for the yarn node managers. Let me know if you see anything different or if anything is missing before closing!

elukey set the point value for this task to 3.Dec 18 2017, 2:27 PM

elukey moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Dec 18 2017, 4:01 PM

Thanks! I'll try this out this week and see how things go.

• Nuria closed this task as Resolved.Dec 19 2017, 11:12 PM

Enable more accurate smaps based RSS tracking by yarn nodemanagerClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Enable more accurate smaps based RSS tracking by yarn nodemanager
Closed, ResolvedPublic3 Estimated Story Points
Actions