Page MenuHomePhabricator

[Investigate] Hadoop integration for ORES training
Closed, ResolvedPublic

Description

See also T168913. There might be some big gains if we're able to efficiently use CPUs when running multiple training jobs, for example when we have to rebuild models for all languages after introducing breaking changes to model serialization or other parameters.

@EBernhardson seems to be using xgboost to train models, which includes Hadoop integration. He might have some experiences to share with us.

It's possible to distribute our existing framework across Hadoop using pure Python, see PySpark and also https://ihadanny.wordpress.com/2014/12/01/python-virtualenv-with-pig-streaming/

Finally, we might be able to train as before, but export the trained scikit-learn models as PMML and do testing steps on Hadoop.

Event Timeline

I toyed around a bit with adjusting revscoring tune to use spark for the parallelization. Running against the enwiki damaging model i get a runtime without spark, but running on a mostly idle 40 core server, of 4m39s. With some hacked in spark integration that drops to 1m51s. Really though trying to improve from 5 minutes not going to be able to get too much improvement due to Amdahl's law (spinning up spark, and building the observations set takes about half this time).

Looking at some results of the run it also appears the 60s cv timeout doesn't work in the spark version for some reason. I tried lowering it to 15s as a test but some jobs still run for >60s.