High level task organizing necessary adjustments to the elasticsearch learning to rank plugin, and additional custom query types we want to make available in elasticsearch for learning new models.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Invalid | None | T174064 [FY 2017-18 Objective] Implement advanced search methodologies | |||
Resolved | EBernhardson | T161632 [Epic] Improve search by researching and deploying machine learning to re-rank search results | |||
Resolved | • dcausse | T162062 Specialize elasticsearch learning to rank plugin for our use case |
Event Timeline
A somewhat open question here is which direction do we go for the elasticsearch plugin, there are two options:
https://github.com/o19s/elasticsearch-learning-to-rank
- Simple and straight forward, maybe too simple?
- Expects more of the code talking to it.
- Nothing built in for feature logging, instead preferring to use the _msearch api with many queries provided.
- Feature queries must be provided with each search query. This seems undesirable because it allows for there to be discrepancies between queries used to train a model, and queries used for production ranking.
https://github.com/wikimedia/search-ltr
- Port of the Solr learn to rank plugin
- Has fairly complicated data plumbing, and is thousands of lines of code
- Relatedly, has bugs that haven't been figured out.
- Stores all queries related to a model within the model itself, making deployments much easier
- Has features built in for logging results for training purposes. Sub-optimal though as the only destination is log4j.
- Has a place for feature normalization. With LambdaRank this isn't conceptually necessary, but some implementations may work better with normalized data. Needs evaluation
Missing in both
- Ideally we would like to find a way to derive new features from combinations of other features without re-calculating them. While decision trees have the ability to use multiple features to come up with final ranking, sometimes (needs evaluation) they do better when features are pre-combined, such as perhaps (bm25 title match) * (contains featured_article template). This may be premature and unnecessary for initial deployment.
Currenly leaning towards forking the o19s plugin, and upstreaming where appropriate. The simplicity is nice and makes things much easier to understand, but the lack of feature logging and storing queries with the models seems undesirable. We can upstream code as we go, but if upstream isn't interested or wants to go a different way a fork prevents us from getting blocked.
working on top of o19s for now trying to find the best place to parse stored feature queries.
@EBernhardson, not sure how we'll review changes on the o19s plugin... i.e. I've uploaded https://github.com/o19s/elasticsearch-learning-to-rank/pull/27 (not directly related to this task)
WIP patch, untested, still a lot todo: https://github.com/nomoa/elasticsearch-learning-to-rank/tree/feature_store
FTR: features for v1 are being discussed here: https://docs.google.com/document/d/1_DWPmLi9oDem3QWxQAgKbqZ5F9oo0XXyHicsXmbWFDQ/edit#heading=h.4ph8tp8jfkwp