Page MenuHomePhabricator

Evaluate training speed and accuracy for 1M and 30M sample training sets with different worker counts
Closed, ResolvedPublic

Description

To efficiently use our training resources we need to know how much parallelism to use for each portion of the training. There are a few levels of parallelism:

  • The number of workers used to train a single model
  • The number of models trained in parallel for cross-validation
  • The number of cross-validations run in parallel.

Primarily this ticket is concerned with the first type, although some information about how things scale when more models are trained in parallel could also be useful.

Event Timeline

With a 9 features and a 1.1M train set, 10M test set:

n_workerstook(s)cpu tookholdout-test-ndcg@10
1134.8s134.8s0.8274
292.5s185s0.8283
384.8s254.4s0.8275
479.3s317.2s0.8280
591.4s457.0s0.8277

At least for this sample it looks like either 1 or 2 workers is all we need. A single worker is the most computationally efficient, and 4 workers is the most real-time efficient. Using 4 workers vs 2 needs 2.3x more resources to do the same thing though. Using 2 workers vs 1 is 1.4x less efficient use of resources, but 1.5x more time efficient.

Running into Hadoop limitations for testing the 30M counts - moving to backlog for now.

I seem to have misplaced the raw data used for this ... but to summarize:

If the data all fits on a single executor, that is the most efficient use of cluster resources. This may not be the fastest way to train individual models, but if we are doing hyperparameter tuning we are generally training many models in parallel, and the lowest total cpu time used per model comes from using a single executor.

Training speed vs core count looks to stay relatively flat up to about 6 cores. Less parallelism is again more efficient in terms of total efficiency of the cluster, but up to 6 cores has a very minimal decrease. After 6 cores the efficiency loss starts to increase at a greater rate.

Overall suggestions:

  • Train models with 4 or 6 cores per executor
  • Aim for a single executor if reasonable.
  • Limitation: Cluster has ~2GB of memory per core, so training data (with duplicates, due to spark storage, task data, and xgboost DMatrix copy in CPP) needs to fit in 4*2 or 6*2 GB of memory. This is actually quite reasonable with our current feature size, but may need to be revisited is we dramatically increase the number of features used.

Other:

  • Minimum amounts of memory that work fine for training a single model will overrun their memory allocation regularly when used to train in mjolnir. We need to over provision memory vs what it takes to spin up a spark instance and train a single model. Perhaps this is some sort of leak, or late de-allocation, in xgboost? unsure.

We should document this info somewhere...but no more work needs to be done on this at this time.

Change 387442 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[search/MjoLniR@master] Update resource usage docs

https://gerrit.wikimedia.org/r/387442

Change 387442 merged by DCausse:
[search/MjoLniR@master] Update resource usage docs

https://gerrit.wikimedia.org/r/387442