Page MenuHomePhabricator

Experiment with varying MLR training hyperparameter space
Closed, ResolvedPublic

Description

Maybe this should be broken up more, but here are a few ideas to investigate from our weekly relevance meeting. The idea is to get a better understanding of our space, and what it should be. One additional difficulty is this might need to be tested on both xgboost and lightgbm, as they vary.

  • Experiment with variations in number of trees for small wikis, does current setting of 500 help? Is it better to have more trees with less leaves, or less trees with more leaves? This also needs to be experimented with on different dataset sizes. Training params: total_leafs, number of trees: num_leafs = total_leafs / num_trees
  • Train the same data with the same hyperparameter space multiple (3, 5?) times to get an idea of expected variance
  • Train with more hyperopt iterations (300? 500?), to see if continued search is beneficial.
  • Expand the space searched for individual hyperparameters until it is clear that quality degrades at the edges. Some parameters seem to be "bumping up" against their limits.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 31 2018, 6:41 PM
EBernhardson renamed this task from Experiment with varying the to Experiment with varying MLR training hyperparameter space.Jan 31 2018, 6:43 PM
EBernhardson updated the task description. (Show Details)

First up: Train the same data with the same hyperparameter space multiple (3) times to get an idea of expected variance

5 would have been nice, but even 3 took a significant amount of resources (which unfortunately collided with the end of month jobs).

min diff: 0.0000
max diff: 0.0016
mean diff: 0.0005

wikiobs (k)trialsminmaxdiffstd
enwiki343573-0.8391-0.83910.00000.0000
frwiki91992-0.8554-0.85530.00010.0000
dewiki201463-0.8568-0.85670.00010.0000
ruwiki94132-0.8381-0.83800.00010.0000
itwiki49302-0.8676-0.86750.00010.0000
nlwiki13483-0.8707-0.87050.00010.0001
zhwiki23813-0.8418-0.84160.00020.0001
arwiki35793-0.7299-0.72960.00020.0001
ptwiki86522-0.7791-0.77870.00040.0002
jawiki30833-0.8791-0.87870.00040.0002
kowiki5622-0.8493-0.84880.00040.0002
fiwiki4543-0.8492-0.84880.00040.0002
plwiki17553-0.8805-0.87990.00060.0002
hewiki1842-0.8406-0.83990.00060.0003
idwiki5983-0.8176-0.81690.00070.0003
fawiki6543-0.7344-0.73350.00090.0004
svwiki3543-0.8808-0.87970.00110.0004
nowiki1283-0.8869-0.88560.00130.0006
viwiki3423-0.8110-0.80940.00160.0007

This isn't too far off from a list of wikis ordered by number of available observations. It seems like on the largest wikis improvements of 0.0005 or more might be something, where on smaller wikis we might want to see 0.001 or 0.0015. These are actually a good bit smaller than I was expecting

EBernhardson added a comment.EditedFeb 7 2018, 3:19 AM

Next up: Experiment with variations in number of trees for small wikis, does current setting of 500 help?

Ran 400 iterations of hyperopt against kowiki (~562k observations, on the small side) with number of leafs from 8000 to 64000 along with the normal hyperparameters, and setting number of iterations based on leafs >> max_depth. Best cv-test-ndcg@10 of 0.8497. Previous best without tuning leaf counts was 0.8493. For the real kicker, this was achieved with number of leafs of 8014 and max_depth of 4, which works out to 500 trees and exactly how many was arbitrarily chosen before this test. That bumps right up into the lower limit so it might be worth searching a little more, but i expect improvement would still be minimal.

This does lead me to a question though, should we add ndcg@3 and ndcg@1 metrics to standard evaluations? It's possible there is improvement that is limited to only the top results?

Overall this doesn't seem to need further investigation. It may be a minor optimization but probably not worth spending much effort or sacrificing training speed for.

EBernhardson added a comment.EditedFeb 7 2018, 3:23 AM

Next up: Train with more hyperopt iterations (300? 500?), to see if continued search is beneficial.

Didn't directly test this, but dug through previous records of training runs. We typically do 150 iterations of hyperopt and I'm seeing on average 15 iterations that report cv-test-ndcg@10 values within 0.0010 of the best value. This suggests to me that the cv-test-ndcg@10 found is bumping into the limits of the current training data and we could even consider dropping the iteration count to 80 or 100. Additionally i ran the above test (tuning by leafs) for 400 iterations and while the value it found was slightly higher, it wasn't high enough to be a significant result.

Adding more features to the feature set may change this, and we should revisit once we add new features and have a few runs worth of training hitsory to evaluate.

My suggestion here is that perhaps we *lower* the number of iterations, as we seem to be doing more iterations than necessary. Perhaps useful would be to add a utilities script to mjolnir to walk through a directory containing training result sub directories to report on stats across runs.

TJones added a subscriber: TJones.
debt closed this task as Resolved.Mar 1 2018, 6:47 PM