[Investigate] Hadoop integration for ORES training
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	awight
	Jul 14 2017, 12:21 AM

Description

See also T168913. There might be some big gains if we're able to efficiently use CPUs when running multiple training jobs, for example when we have to rebuild models for all languages after introducing breaking changes to model serialization or other parameters.

@EBernhardson seems to be using xgboost to train models, which includes Hadoop integration. He might have some experiences to share with us.

It's possible to distribute our existing framework across Hadoop using pure Python, see PySpark and also https://ihadanny.wordpress.com/2014/12/01/python-virtualenv-with-pig-streaming/

Finally, we might be able to train as before, but export the trained scikit-learn models as PMML and do testing steps on Hadoop.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T170650 [Investigate] Hadoop integration for ORES training
		Open		None	T173244 [Investigate] Use PMML for prediction model serialization

Event Timeline

awight created this task.Jul 14 2017, 12:21 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 14 2017, 12:21 AM

awight updated the task description. (Show Details)Jul 14 2017, 4:36 AM

I toyed around a bit with adjusting revscoring tune to use spark for the parallelization. Running against the enwiki damaging model i get a runtime without spark, but running on a mostly idle 40 core server, of 4m39s. With some hacked in spark integration that drops to 1m51s. Really though trying to improve from 5 minutes not going to be able to get too much improvement due to Amdahl's law (spinning up spark, and building the observations set takes about half this time).

Looking at some results of the run it also appears the 60s cv timeout doesn't work in the spark version for some reason. I tried lowering it to 15s as a test but some jobs still run for >60s.

Halfak triaged this task as Low priority.Jul 20 2017, 2:51 PM

Halfak added projects: revscoring, artificial-intelligence, Analytics.

Halfak moved this task from Unsorted to New development on the Machine-Learning-Team board.

• fdans moved this task from Incoming to Radar on the Analytics board.Jul 24 2017, 3:55 PM

Sumit subscribed.Jul 24 2017, 4:38 PM

awight created subtask T173244: [Investigate] Use PMML for prediction model serialization.Aug 13 2017, 1:45 PM

awight unsubscribed.Mar 21 2019, 4:01 PM

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:33 AM

• ACraze closed this task as Resolved.Sep 23 2020, 6:51 PM

[Investigate] Hadoop integration for ORES trainingClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

[Investigate] Hadoop integration for ORES training
Closed, ResolvedPublic
Actions

Related Objects
Search...