See also T168913. There might be some big gains if we're able to efficiently use CPUs when running multiple training jobs, for example when we have to rebuild models for all languages after introducing breaking changes to model serialization or other parameters.
@EBernhardson seems to be using xgboost to train models, which includes Hadoop integration. He might have some experiences to share with us.
It's possible to distribute our existing framework across Hadoop using pure Python, see PySpark and also https://ihadanny.wordpress.com/2014/12/01/python-virtualenv-with-pig-streaming/
Finally, we might be able to train as before, but export the trained scikit-learn models as PMML and do testing steps on Hadoop.