Page MenuHomePhabricator

Evaluate ensemble pruning algorithms
Closed, ResolvedPublic

Description

There are several ways to prune an ensemble of decision trees, but one very simple algorithm worth trying (based on a paper, but i can't seem to find it...)

  • Take ensemble of N trees
  • Calculate ndcg@10 for all ensembles of N-1 in size
  • Choose the N - 1 ensemble with the highest score, remove that tree from the ensemble
  • Repeat

Event Timeline

I implemented a hackish version of this and ran it against a 100 tree model that was trained against 32M observations. Removing trees changes the score as follows.
Not NDCG@10 started at 0.85267 before any pruning occured. Removing trees towards the beginning actually improves the quality of the model, to a max of 0.85452.
after removing 18 trees from the ensemble. We can actually remove 57 trees to get a score of 0.85262, which is basically where we started. I still need to verify this doesn't have any bugs, and it takes a pretty long time to run, but it has the possibility to improve both the prediction speed and quality.

original0.8526755752893869
tree_20.8534814588322
tree_00.8538909954020505
tree_110.8541095838822635
tree_210.8542634118538496
tree_60.8543745280649098
tree_290.8544050371020617
tree_450.8544274826847199
tree_100.8544255519937989
tree_370.8544375869530494
tree_180.8544408309599962
tree_340.8544474561717582
tree_150.8544725794302217
tree_360.8544827545143275
tree_420.8544928642262134
tree_890.8544738462653381
tree_240.8544644147082961
tree_90.854504026449298
tree_530.8545170510360092
tree_930.8545117159368455
tree_270.8545035215863562
tree_950.8544685327050491
tree_320.8544535422863354
tree_640.8544287328329437
tree_550.854395964408186
tree_960.854369822301337
tree_770.8543500151487167
tree_910.854305863199338
tree_140.8542666708971052
tree_410.8542712990849507
tree_700.8542299714553042
tree_70.8542042444488114
tree_580.8541939735330666
tree_490.8541717035392439
tree_130.8541356953790398
tree_510.8541207964722727
tree_970.8540598286533062
tree_990.854015412606587
tree_710.8539626581870964
tree_300.8539199821911418
tree_810.8538999867708735
tree_800.8538461584212078
tree_830.853780443966831
tree_750.853716799079589
tree_120.8536494807128545
tree_460.8535996969085202
tree_440.8535830877257539
tree_650.853513803103564
tree_760.8534410623727549
tree_980.8533755446461535
tree_250.8533070197802961
tree_940.853235075005165
tree_600.8531295188375054
tree_720.8530457529937424
tree_170.8529358390428995
tree_310.8528620763247267
tree_860.8527532003555717
tree_780.8526213412058438
tree_30.8525442344947167
tree_840.852476666779801
tree_920.8523797543642866
tree_900.8522665609566459
tree_590.8521575076764754
tree_740.8520365278367638
tree_480.8519237583555358
tree_880.8517729362227421
tree_790.8515837661945378
tree_560.8513936028459914
tree_570.8511900407064946
tree_230.8509923770767224
tree_200.8508161664007255
tree_10.8508919775646772
tree_500.8508476731761839
tree_850.8506608807573397
tree_820.850436665692155
tree_260.8502532154684804
tree_630.850058777380352
tree_690.849829952828476
tree_680.8495063133023955
tree_870.8491783598531001
tree_190.8488870821504623
tree_610.8485117174089658
tree_660.8480505048222637
tree_40.8477295548028438
tree_380.8477711347061445
tree_390.8477079286163248
tree_730.847305095349432
tree_620.8467135711435149
tree_470.845966782660598
tree_670.8450037023029577
tree_280.8440031773679915
tree_540.842850920442764
tree_520.84124295635184
tree_80.8395389344769163
tree_400.8368481908112617
tree_430.8333844979916419
tree_350.829054869132139
tree_220.8221793380482011
tree_330.811557090832345
tree_160.7735804255678047
tree_50.34272244027773546

Change 376099 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[search/MjoLniR@master] [WIP] Prune trees out of ensemble

https://gerrit.wikimedia.org/r/376099

debt subscribed.

We'll need to decide if this is just interesting or actually useful. We'll need to look at this further, moving to backlog board.

Move into later column based on my comment above.

This was investigated. The final decision was that while interesting, it didn't provide enough benefit over the cost of both maintenance and model build time increases.

Change 376099 abandoned by EBernhardson:
[WIP] Prune trees out of ensemble

Reason:
interesting, but was too expensive to really use

https://gerrit.wikimedia.org/r/376099