Page MenuHomePhabricator

Evaluate libraries for ranking with LambdaRank
Closed, ResolvedPublic

Description

There are three main libraries available for generating LambdaRANK models. Evaluate them to decide which we should use:

  • RankLib
  • XGBoost
  • LightGBM

Evaluation criteria should include:

  • Training efficiency, or how much computational power it takes to train a model
    • Relatedly, can the library be deployed to our hadoop cluster, either by using multiple executors to train one model, or training multiple models in parallel for learning ideal hyper parameters.
  • Ease of performing hyper parameter optimization, and if the set of available parameters is sufficient
  • Ability to transform resulting models into a format suitable for loading into elasticsearch LTR plugin
  • Similarity of results (should be exactly the same, assuming no updates to search index) from transformed models insert into elasticsearch to those generated in the test data set
  • How easy/hard it is to setup and run these models in our analytics network

Event Timeline

Some initial thoughts:

  • All three libraries have slightly different input formats, so we can't generate one output and feed it into all three.
  • xgboost has hadoop integration, where it creates executors and parallelizes the training of models
  • xgboost model result is in a binary format, converting this to something we can use in elasticsearch may be a pain but is certainly doable
  • xgboost has a python integration library
  • lightgbm has its own native parallelization based on sockets, this will probably be difficult to deploy in our analytics network
  • lightgbm model result is in the form of an .ini file. Not sure yet what all the parameters mean, but shouldn't be crazy hard to tranform into another format.
  • RankLib has no parallelization, but as a plain java application we can spin it up inside hadoop executors and run multiple copies for hyper parameter optimization
  • RankLib output format is XML and relatively easy to understand. The o19s plugin takes this format natively, our plugin uses a json format quite similar to the XML structure.
  • RankLib does not support providing weights to data points, while LightGBM and XGBoost do. This may be important for including human relevance labels from discernatron with enough weight that they influence the final model

While exploring our data, i see that training a ranker with queries that have been repeated at least 35 times in a 60 day period limits us to ~20% of sessions being able to be included in training. Dropping that to 10 repeated queries still only brings that up to 40% of sessions. I spent some time exploring and found another plausible algorithm that can handle learning from queries that have only seen a single click: Propensity SVMRank.

This is from a very recent paper from Thorsten Joachims, who wrote several seminal papers in MLR. It is currently used for learning the weights used to power the search at arxiv.org. Being an SVM it's perhaps not as flexible as LambdaRank based solutions but it has the benefit of being able to learn from clicks without requiring multiple clicks per query. Joachims theorizes that the same principles could be applied to pairwise and listwise ranking algorithms, but the effort to do so us left up to the reader and I don't think we have the expertise to do that.

A high level description from a presentation: "If you do propensity weighting you are actually making the variance of your learning algorithm worse in return for having low bias. If you have more and more data your variance goes down. In a big data regime really bias is the thing you really want to avoid, and that's what this propensity weighting does."

If i understand the paper/talk correctly the only special thing we would need is to run an AB test to measure propensities per rank. This test would require reliably swapping result 1 with result k (i think k should stay consistent per-query, but vary across the set of possible queries), and then measuring the click through ratio of the same result between the top result and result k. propensity at k is then (clicks at k / clicks at 1).

There are a couple possibilities, all probably requiring some level of evaluation:

  • Serve all queries with a Propensity SVMRank based ranker
  • Serve all queries with a LambdaRank based ranker
  • Build a bloom filter that encodes all queries that are issued >= X times in our data collection period. Choose between SVMRank and LambdaRank depending on how popular the users query is. This bloom filter could probably be held in redis.

@EBernhardson—Interesting stuff. It would require looking more closely at the math for me to have a firmer notion of whether this is feasible for us, but as always I applaud you for looking out there to see what interesting research is going on.

Some high level information about speed of training, model evaluation is going to take a little more work

source: 20k normalized queries from enwiki, dewiki, frwiki and ruwiki (80k total)
N rows in set: 9,543,038
n groups (queries): 458,756
n features: 11
matrix size: 53,867,089

RankLib LambdaMART 5-fold CV (training 5 times) ~300 trees, 10 leafs: ~120 minutes, ~24 minutes each, ~4s per tree
xgboost w/yarn and 10 workers, 4 cores each, 400 trees, rank:ndcg, tree_method=hist, no test/train split (yet): Broken, trees have 0 nodes
xgboost w/yarn and 10 workers, 4 cores each, 400 trees, rank:ndcg, tree_method=approx, no test/train split (yet): ~10.5minutes, 1.6s per tree

xgboost w/yarn and 10 workers, 4 cores each, 400 trees, rank:pairwise, tree_method=hist, no test/train split (yet): ~7.5 minutes, 1.1s per tree
xgboost w/yarn and 10 workers, 4 cores each, 400 trees, rank:pairwise, tree_method=approx, no test/train split (yet): ~10 minutes, 1.5s per tree

xgboost local (~10 cores utilized), 400 trees, rank:ndcg tree_method=hist, depth=4, no test/train split (yet): ~17 minutes, 2.5s per tree

local xgboost is slightly faster, but not quite 2x so the difference really isn't that important as opposed to performance (still to be evaluated, requires hyperparameter tuning. Which will also vary the time taken due to tree sizes and learning rate changing). The distributed version of xgboost is a bit more than 2x faster, but the new histogram based training doesn't work with ndcg. The traditional approximation method is ~2.5x faster than ranklib.

Based on the relatively minor difference in training speed of single models, either could be reasonable. Will depend though how much more time it takes once we do hyperparameter optimization.

Other:

  • was having problems with distributed xgboost with 0.60 release, above is using the master branch (which includes tree_method=hist, based off lightgbm)

After poking around at the various options, I settled on xgboost running over yarn. The integration with hadoop/spark makes this very convenient for distributing work to a variety of machines that have compute resources available, and taking data in a format we already have (spark dataframes). The scores from training the same data with different implementations was close enough to not be a big deal.

debt subscribed.

Cool!