[Investigate] Hadoop integration for ORES training
Open, LowPublic

Description

See also T168913. There might be some big gains if we're able to efficiently use CPUs when running multiple training jobs, for example when we have to rebuild models for all languages after introducing breaking changes to model serialization or other parameters.

@EBernhardson seems to be using xgboost to train models, which includes Hadoop integration. He might have some experiences to share with us.

It's possible to distribute our existing framework across Hadoop using pure Python, see PySpark and also https://ihadanny.wordpress.com/2014/12/01/python-virtualenv-with-pig-streaming/

Finally, we might be able to train as before, but export the trained scikit-learn models as PMML and do testing steps on Hadoop.

awight created this task.Jul 14 2017, 12:21 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 14 2017, 12:21 AM
awight updated the task description. (Show Details)Jul 14 2017, 4:36 AM
EBernhardson added a comment.EditedJul 15 2017, 1:55 AM

I toyed around a bit with adjusting revscoring tune to use spark for the parallelization. Running against the enwiki damaging model i get a runtime without spark, but running on a mostly idle 40 core server, of 4m39s. With some hacked in spark integration that drops to 1m51s. Really though trying to improve from 5 minutes not going to be able to get too much improvement due to Amdahl's law (spinning up spark, and building the observations set takes about half this time).

Looking at some results of the run it also appears the 60s cv timeout doesn't work in the spark version for some reason. I tried lowering it to 15s as a test but some jobs still run for >60s.

Halfak triaged this task as Low priority.Jul 20 2017, 2:51 PM
Halfak moved this task from Backlog to New development on the Scoring-platform-team board.
fdans moved this task from Incoming to Radar on the Analytics board.Jul 24 2017, 3:55 PM
Sumit added a subscriber: Sumit.Jul 24 2017, 4:38 PM