Evaluate training set sizes for LTR models
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Jun 22 2017, 4:42 PM

Description

Train models with the default feature set and varied sizes of models to determine how much data we should use for generating production models.

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T174064 [FY 2017-18 Objective] Implement advanced search methodologies
Resolved	EBernhardson	T161632 [Epic] Improve search by researching and deploying machine learning to re-rank search results
Resolved	EBernhardson	T168664 Evaluate training set sizes for LTR models

Event Timeline

EBernhardson created this task.Jun 22 2017, 4:42 PM

debt lowered the priority of this task from High to Medium.Jun 22 2017, 5:11 PM

debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.

debt moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

I started by creating a complete set of training data with no sampling, this is all the data we could possibly use from the last 90 days. With the new query normalization this comes out to ~420k normalized queries, and 43M (query, hit_page_id) pairs. I split off 25% of these into a test set, then made multiple training sets each using 33% of the previous training set.

The baseline ndcg, that we displayed to users historically, was 0.8222
Columns:

# of samples - The size of the training set
test-ndcg@10 - The ndcg@10 against the 10M sample hold-out set
diff previous - The improvement in test-ndcg@10 from the next smaller training set
improvement over baseline - The difference between historical ndcg and test-ndcg@10
improvement % - The improvement divided by (1-baseline). Basically the % difference between the improvement and a perfect score.
diff - The difference in improvement % from the next smaller training set
cv-ndcg@10 - The ndcg@10 reported by cross validation of the best parameters
cv-diff - The difference between the cv-ndcg@10 and test-ndcg@10. Positive means CV was over-optimistic
train time - How long it took to train the model. These are not set in stone, and vary due to choices made about how many cpu's to use when training. Also note that because of hyperparameter tuning and cross validation, this is actually the time to build 650 models.

Raw data:

# of samples	test-ndcg@10	diff	improvement	improvement %	diff	cv-ndcg@10	cv-diff	train time
44000	0.8265		0.004255129328	2.39%		0.84579	0.01929	1:11:17
132000	0.83739734	0.01089734	0.01515246933	8.52%	6.13%	0.840388	0.00299066	0:56:40
396000	0.8440638	0.00666646	0.02181892933	12.27%	3.75%	0.847715	0.0036512	1:59:42
1193000	0.84886889	0.00480509	0.02662401933	14.98%	2.70%	0.8479	-0.00096889	2:09:50
3574000	0.8514084	0.00253951	0.02916352933	16.41%	1.43%	0.849212	-0.0021964	3:25:34
10718000	0.8535884	0.00218	0.03134352933	17.63%	1.23%	0.852435	-0.0011534	6:54:17
32153000	0.8560576	0.0024692	0.03381272933	19.02%	1.39%	0.8549244	-0.0011332	10:38:08

Graph:

This is using a log scale on the X axis

Conclusions:

Not sure. The training time on 32M samples is frankly pretty excessive, and it took half the hadoop cluster to achieve the almost 11 hour runtime. In total it used about 209 days of CPU time.

EBernhardson moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Jun 22 2017, 5:25 PM

@mpopov @dcausse @TJones @Dworley Any thoughts?

EBernhardson added a subscriber: mpopov.Jun 22 2017, 5:42 PM

I was curious so ran an analysis of the final models for ndcg@3 as well

baseline ndcg@3: 0.7757236139

# of samples	ndcg@3	diff	improvement	improvement %	diff
44000	0.773285		-0.002438613888	-1.09%
132000	0.789166	0.015881	0.01344238611	5.99%	7.08%
396000	0.797937	0.008771	0.02221338611	9.90%	3.91%
1193000	0.804357	0.00642	0.02863338611	12.77%	2.86%
3574000	0.807802	0.003445	0.03207838611	14.30%	1.54%
10718000	0.810789	0.002987	0.03506538611	15.63%	1.33%
32153000	0.813966	0.003177	0.03824238611	17.05%	1.42%

Thanks for all this data, @EBernhardson!

Search improvement is generally a game of inches, and we have millions of users, so every 1% improvement really does matter. We know that ndcg@3 might be a better metric for what users are going to care about, while ndcg@10 lets us see the bigger picture.

How is excessive is 11 hours, really? It's surely annoying now, but will it be in production? Do we have a rough idea of how often we are going to be building models? If it is going to be a daily occurrence, then 11 hours is insane. If it is quarterly (or less frequent), does letting it run overnight really matter? I know eventually we have to extrapolate out to include other wikis, too. But if we had a nice automated pipeline that would extract, build, evaluate, and deploy new models, then 10-20 hours x half the cluster of training time per week doesn't seem entirely ridiculous.

That said, If it was necessary, we could get by with the 3.6M or 10.7M training set sizes. (BTW, is it possible to get substantially better run times on the smaller sets if you give them the resources of the larger sets? You said the config varied a bit, but there must be some point where the benefit of the parallelism maxes out during optimization.)

it certainly makes sense to test options—training regimes, new features, etc—on a smaller set to get a sense of the impact, and then use/test the best-seeming option(s) on the larger set for training a final model.

So, I'd say we could get away with using the 3.6M or 10.7M training sets for training production models if training time dictated it, and the 1.2M set would probably make sense for test new options/configs. I'd still love to see us use all the data if it actually helps.

How is excessive is 11 hours, really? It's surely annoying now, but will it be in production? Do we have a rough idea of how often we are going to be building models? If it is going to be a daily occurrence, then 11 hours is insane. If it is quarterly (or less frequent), does letting it run overnight really matter? I know eventually we have to extrapolate out to include other wikis, too. But if we had a nice automated pipeline that would extract, build, evaluate, and deploy new models, then 10-20 hours x half the cluster of training time per week doesn't seem entirely ridiculous.

It certainly depends on how many wikis we are training models for, but indeed a weekly automated job that takes 11 hours is not the end of the world. For feature engineering though 11 hours is far too long of a feedback cycle. Hopefully though improvements from feature engineering at 1M samples will extrapolate to 40M samples.

That said, If it was necessary, we could get by with the 3.6M or 10.7M training set sizes. (BTW, is it possible to get substantially better run times on the smaller sets if you give them the resources of the larger sets? You said the config varied a bit, but there must be some point where the benefit of the parallelism maxes out during optimization.)

For reference it looks like these were the settings i used for each. Note that each executor has 4 cpu cores.

samples	executors per model	models in parallel	total cores	runtime	(runtime * cores) / samples
44k	2	5	40	1:11:17	3.88s
132k	3	5	60	0:56:40	1.54s
396k	4	5	80	1:59:42	1.45s
1193k	5	5	100	2:09:50	0.65s
3574k	7	5	140	3:25:34	0.48s
10718k	10	5	200	6:54:17	0.46s
32153k	23	5	460	10:38:08	0.55s

I spent the day figuring out how to get hyperopt to run trials in parallel, and then the other half of the day solving a race condition it exposed in xgboost4j-spark. fun! We can possibly run 4 iterations in parallel and get the 1.2M sample down to 30-40 minutes. It might also be that just assigning more executors to individual model training will help, but i havn't run the analysis on that. Perhaps i should do that first to get an idea of how training individual models scales with the number of cores provided, so far I've just been guessing

So, I'd say we could get away with using the 3.6M or 10.7M training sets for training production models if training time dictated it, and the 1.2M set would probably make sense for test new options/configs. I'd still love to see us use all the data if it actually helps.

I suppose i'll plan for production models to be built against all the data, and feature engineering to run against smaller sample sizes. Hopefully improvements in features of the smaller models will translate into improvements on the model with more input data.

Not directly related to this test, but some feature importance graphs from the best 32M sample model:

xgboost_enwiki_32M_feature_importance (497×1 px, 24 KB)

weight - The number of splits utilizing this feature
gain - The average quality of a split using this feature. This might be affected by cover, basically if the split affects few samples then the gain is small even if it does a good job of splitting that small # of samples, but i'm not completely sure.
cover - Related to the average # of samples split by this feature

This was calculated with: https://paws-public.wmflabs.org/paws-public/User:EBernhardson_(WMF)/XGBoost%20enwiki%2032M%20sample%20feature%20importance.ipynb

Running 6 cv's in parallel, using a peak of 150 executors (600 cores), the 1.2M training is down to 28 minutes. I was a little worried hyperopt wouldn't do as good of a job because it had less iterations to select the next training point on (because it is 6 iterations ahead of the results) but the final cv-ndcg@10 is within .0004 of the previous run. ndcg@10 on the hold-out 10M test set is within .0003 and ndcg@3 is within .0002. This could be slightly worse hyperparameters selected, or it could be within the bounds of error.

Again, thanks for all the data, Erik!

Sounds quite reasonable to be able to do feature engineering on ~1M and final training on the full set.

The weight/gain/cover chat is interesting. all_near_match is very interesting—very low weight, max gain and cover. My guess is it is one of the first splits in every tree. incoming_links is the opposite—so it's probably trying to make many, many little splits near the leaves of the tree. Neat!

gain - The average quality of a split using this feature. This might be affected by cover, basically if the split affects few samples then the gain is small even if it does a good job of splitting that small # of samples, but i'm not completely sure.

Thinking about this some more, i think what actually happens is that they are related but not so directly. Basically when a split happens near the bottom of the tree most of the possible gain has already been captured by previous splits. By the bottom of the tree the possible gain from a split is fairly small. Cover is also small at the bottom of the tree as the samples have been split so many times.

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Jun 27 2017, 5:17 PM

debt closed this task as Resolved.Jun 30 2017, 9:21 PM

	F8515842: xgboost_enwiki_32M_feature_importance
	Jun 24 2017, 12:53 AM

	F8507704: mjolnir training set size comparison
	Jun 22 2017, 5:12 PM

Evaluate training set sizes for LTR modelsClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Evaluate training set sizes for LTR models
Closed, ResolvedPublic
Actions

Related Objects
Search...