Evaluate training speed and accuracy for 1M and 30M sample training sets with different worker counts
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Jul 7 2017, 6:54 PM

Description

To efficiently use our training resources we need to know how much parallelism to use for each portion of the training. There are a few levels of parallelism:

The number of workers used to train a single model
The number of models trained in parallel for cross-validation
The number of cross-validations run in parallel.

Primarily this ticket is concerned with the first type, although some information about how things scale when more models are trained in parallel could also be useful.

Details

	Subject	Repo	Branch	Lines +/-
	Update resource usage docs	search/MjoLniR	master	+89 -60

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T174064 [FY 2017-18 Objective] Implement advanced search methodologies
Resolved	EBernhardson	T161632 [Epic] Improve search by researching and deploying machine learning to re-rank search results
Resolved	EBernhardson	T170009 Evaluate training speed and accuracy for 1M and 30M sample training sets with different worker counts

Event Timeline

EBernhardson created this task.Jul 7 2017, 6:54 PM

With a 9 features and a 1.1M train set, 10M test set:

n_workers	took(s)	cpu took	holdout-test-ndcg@10
1	134.8s	134.8s	0.8274
2	92.5s	185s	0.8283
3	84.8s	254.4s	0.8275
4	79.3s	317.2s	0.8280
5	91.4s	457.0s	0.8277

At least for this sample it looks like either 1 or 2 workers is all we need. A single worker is the most computationally efficient, and 4 workers is the most real-time efficient. Using 4 workers vs 2 needs 2.3x more resources to do the same thing though. Using 2 workers vs 1 is 1.4x less efficient use of resources, but 1.5x more time efficient.

EBernhardson added a comment.Jul 7 2017, 7:44 PM

This comment was removed by EBernhardson.

EBernhardson moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Jul 11 2017, 3:55 PM

Running into Hadoop limitations for testing the 30M counts - moving to backlog for now.

I seem to have misplaced the raw data used for this ... but to summarize:

If the data all fits on a single executor, that is the most efficient use of cluster resources. This may not be the fastest way to train individual models, but if we are doing hyperparameter tuning we are generally training many models in parallel, and the lowest total cpu time used per model comes from using a single executor.

Training speed vs core count looks to stay relatively flat up to about 6 cores. Less parallelism is again more efficient in terms of total efficiency of the cluster, but up to 6 cores has a very minimal decrease. After 6 cores the efficiency loss starts to increase at a greater rate.

Overall suggestions:

Train models with 4 or 6 cores per executor
Aim for a single executor if reasonable.
Limitation: Cluster has ~2GB of memory per core, so training data (with duplicates, due to spark storage, task data, and xgboost DMatrix copy in CPP) needs to fit in 4*2 or 6*2 GB of memory. This is actually quite reasonable with our current feature size, but may need to be revisited is we dramatically increase the number of features used.

Other:

Minimum amounts of memory that work fine for training a single model will overrun their memory allocation regularly when used to train in mjolnir. We need to over provision memory vs what it takes to spin up a spark instance and train a single model. Perhaps this is some sort of leak, or late de-allocation, in xgboost? unsure.

EBernhardson moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Oct 2 2017, 5:17 PM

We should document this info somewhere...but no more work needs to be done on this at this time.

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Oct 30 2017, 10:24 PM

Change 387442 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[search/MjoLniR@master] Update resource usage docs

https://gerrit.wikimedia.org/r/387442

gerritbot added a project: Patch-For-Review.Oct 30 2017, 10:27 PM

Change 387442 merged by DCausse:
[search/MjoLniR@master] Update resource usage docs

https://gerrit.wikimedia.org/r/387442

debt closed this task as Resolved.Nov 9 2017, 10:10 PM

Evaluate training speed and accuracy for 1M and 30M sample training sets with different worker countsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Evaluate training speed and accuracy for 1M and 30M sample training sets with different worker counts
Closed, ResolvedPublic
Actions

Related Objects
Search...