Page MenuHomePhabricator

Evaluate training set sizes for LTR models
Closed, ResolvedPublic

Description

Train models with the default feature set and varied sizes of models to determine how much data we should use for generating production models.

Event Timeline

debt lowered the priority of this task from High to Medium.Jun 22 2017, 5:11 PM
debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.

I started by creating a complete set of training data with no sampling, this is all the data we could possibly use from the last 90 days. With the new query normalization this comes out to ~420k normalized queries, and 43M (query, hit_page_id) pairs. I split off 25% of these into a test set, then made multiple training sets each using 33% of the previous training set.

The baseline ndcg, that we displayed to users historically, was 0.8222
Columns:

  • # of samples - The size of the training set
  • test-ndcg@10 - The ndcg@10 against the 10M sample hold-out set
  • diff previous - The improvement in test-ndcg@10 from the next smaller training set
  • improvement over baseline - The difference between historical ndcg and test-ndcg@10
  • improvement % - The improvement divided by (1-baseline). Basically the % difference between the improvement and a perfect score.
  • diff - The difference in improvement % from the next smaller training set
  • cv-ndcg@10 - The ndcg@10 reported by cross validation of the best parameters
  • cv-diff - The difference between the cv-ndcg@10 and test-ndcg@10. Positive means CV was over-optimistic
  • train time - How long it took to train the model. These are not set in stone, and vary due to choices made about how many cpu's to use when training. Also note that because of hyperparameter tuning and cross validation, this is actually the time to build 650 models.

Raw data:

# of samplestest-ndcg@10diffimprovementimprovement %diffcv-ndcg@10cv-difftrain time
440000.82650.0042551293282.39%0.845790.019291:11:17
1320000.837397340.010897340.015152469338.52%6.13%0.8403880.002990660:56:40
3960000.84406380.006666460.0218189293312.27%3.75%0.8477150.00365121:59:42
11930000.848868890.004805090.0266240193314.98%2.70%0.8479-0.000968892:09:50
35740000.85140840.002539510.0291635293316.41%1.43%0.849212-0.00219643:25:34
107180000.85358840.002180.0313435293317.63%1.23%0.852435-0.00115346:54:17
321530000.85605760.00246920.0338127293319.02%1.39%0.8549244-0.001133210:38:08

Graph:

  • This is using a log scale on the X axis
    mjolnir training set size comparison (480×640 px, 19 KB)

Conclusions:

  • Not sure. The training time on 32M samples is frankly pretty excessive, and it took half the hadoop cluster to achieve the almost 11 hour runtime. In total it used about 209 days of CPU time.

I was curious so ran an analysis of the final models for ndcg@3 as well

baseline ndcg@3: 0.7757236139

# of samplesndcg@3diffimprovementimprovement %diff
440000.773285-0.002438613888-1.09%
1320000.7891660.0158810.013442386115.99%7.08%
3960000.7979370.0087710.022213386119.90%3.91%
11930000.8043570.006420.0286333861112.77%2.86%
35740000.8078020.0034450.0320783861114.30%1.54%
107180000.8107890.0029870.0350653861115.63%1.33%
321530000.8139660.0031770.0382423861117.05%1.42%

Thanks for all this data, @EBernhardson!

Search improvement is generally a game of inches, and we have millions of users, so every 1% improvement really does matter. We know that ndcg@3 might be a better metric for what users are going to care about, while ndcg@10 lets us see the bigger picture.

How is excessive is 11 hours, really? It's surely annoying now, but will it be in production? Do we have a rough idea of how often we are going to be building models? If it is going to be a daily occurrence, then 11 hours is insane. If it is quarterly (or less frequent), does letting it run overnight really matter? I know eventually we have to extrapolate out to include other wikis, too. But if we had a nice automated pipeline that would extract, build, evaluate, and deploy new models, then 10-20 hours x half the cluster of training time per week doesn't seem entirely ridiculous.

That said, If it was necessary, we could get by with the 3.6M or 10.7M training set sizes. (BTW, is it possible to get substantially better run times on the smaller sets if you give them the resources of the larger sets? You said the config varied a bit, but there must be some point where the benefit of the parallelism maxes out during optimization.)

it certainly makes sense to test options—training regimes, new features, etc—on a smaller set to get a sense of the impact, and then use/test the best-seeming option(s) on the larger set for training a final model.

So, I'd say we could get away with using the 3.6M or 10.7M training sets for training production models if training time dictated it, and the 1.2M set would probably make sense for test new options/configs. I'd still love to see us use all the data if it actually helps.

How is excessive is 11 hours, really? It's surely annoying now, but will it be in production? Do we have a rough idea of how often we are going to be building models? If it is going to be a daily occurrence, then 11 hours is insane. If it is quarterly (or less frequent), does letting it run overnight really matter? I know eventually we have to extrapolate out to include other wikis, too. But if we had a nice automated pipeline that would extract, build, evaluate, and deploy new models, then 10-20 hours x half the cluster of training time per week doesn't seem entirely ridiculous.

It certainly depends on how many wikis we are training models for, but indeed a weekly automated job that takes 11 hours is not the end of the world. For feature engineering though 11 hours is far too long of a feedback cycle. Hopefully though improvements from feature engineering at 1M samples will extrapolate to 40M samples.

That said, If it was necessary, we could get by with the 3.6M or 10.7M training set sizes. (BTW, is it possible to get substantially better run times on the smaller sets if you give them the resources of the larger sets? You said the config varied a bit, but there must be some point where the benefit of the parallelism maxes out during optimization.)

For reference it looks like these were the settings i used for each. Note that each executor has 4 cpu cores.

samplesexecutors per modelmodels in paralleltotal coresruntime(runtime * cores) / samples
44k25401:11:173.88s
132k35600:56:401.54s
396k45801:59:421.45s
1193k551002:09:500.65s
3574k751403:25:340.48s
10718k1052006:54:170.46s
32153k23546010:38:080.55s

I spent the day figuring out how to get hyperopt to run trials in parallel, and then the other half of the day solving a race condition it exposed in xgboost4j-spark. fun! We can possibly run 4 iterations in parallel and get the 1.2M sample down to 30-40 minutes. It might also be that just assigning more executors to individual model training will help, but i havn't run the analysis on that. Perhaps i should do that first to get an idea of how training individual models scales with the number of cores provided, so far I've just been guessing

So, I'd say we could get away with using the 3.6M or 10.7M training sets for training production models if training time dictated it, and the 1.2M set would probably make sense for test new options/configs. I'd still love to see us use all the data if it actually helps.

I suppose i'll plan for production models to be built against all the data, and feature engineering to run against smaller sample sizes. Hopefully improvements in features of the smaller models will translate into improvements on the model with more input data.

Not directly related to this test, but some feature importance graphs from the best 32M sample model:

xgboost_enwiki_32M_feature_importance (497×1 px, 24 KB)

  • weight - The number of splits utilizing this feature
  • gain - The average quality of a split using this feature. This might be affected by cover, basically if the split affects few samples then the gain is small even if it does a good job of splitting that small # of samples, but i'm not completely sure.
  • cover - Related to the average # of samples split by this feature

This was calculated with: https://paws-public.wmflabs.org/paws-public/User:EBernhardson_(WMF)/XGBoost%20enwiki%2032M%20sample%20feature%20importance.ipynb

Running 6 cv's in parallel, using a peak of 150 executors (600 cores), the 1.2M training is down to 28 minutes. I was a little worried hyperopt wouldn't do as good of a job because it had less iterations to select the next training point on (because it is 6 iterations ahead of the results) but the final cv-ndcg@10 is within .0004 of the previous run. ndcg@10 on the hold-out 10M test set is within .0003 and ndcg@3 is within .0002. This could be slightly worse hyperparameters selected, or it could be within the bounds of error.

Again, thanks for all the data, Erik!

Sounds quite reasonable to be able to do feature engineering on ~1M and final training on the full set.

The weight/gain/cover chat is interesting. all_near_match is very interesting—very low weight, max gain and cover. My guess is it is one of the first splits in every tree. incoming_links is the opposite—so it's probably trying to make many, many little splits near the leaves of the tree. Neat!

gain - The average quality of a split using this feature. This might be affected by cover, basically if the split affects few samples then the gain is small even if it does a good job of splitting that small # of samples, but i'm not completely sure.

Thinking about this some more, i think what actually happens is that they are related but not so directly. Basically when a split happens near the bottom of the tree most of the possible gain has already been captured by previous splits. By the bottom of the tree the possible gain from a split is fairly small. Cover is also small at the bottom of the tree as the samples have been split so many times.