Maniphest T162062

Specialize elasticsearch learning to rank plugin for our use case
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Apr 3 2017, 4:27 PM

Description

High level task organizing necessary adjustments to the elasticsearch learning to rank plugin, and additional custom query types we want to make available in elasticsearch for learning new models.

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T174064 [FY 2017-18 Objective] Implement advanced search methodologies
Resolved	EBernhardson	T161632 [Epic] Improve search by researching and deploying machine learning to re-rank search results
Resolved	• dcausse	T162062 Specialize elasticsearch learning to rank plugin for our use case

Event Timeline

EBernhardson created this task.Apr 3 2017, 4:27 PM

A somewhat open question here is which direction do we go for the elasticsearch plugin, there are two options:

https://github.com/o19s/elasticsearch-learning-to-rank

Simple and straight forward, maybe too simple?
Expects more of the code talking to it.
- Nothing built in for feature logging, instead preferring to use the _msearch api with many queries provided.
- Feature queries must be provided with each search query. This seems undesirable because it allows for there to be discrepancies between queries used to train a model, and queries used for production ranking.

https://github.com/wikimedia/search-ltr

Port of the Solr learn to rank plugin
Has fairly complicated data plumbing, and is thousands of lines of code
Relatedly, has bugs that haven't been figured out.
Stores all queries related to a model within the model itself, making deployments much easier
Has features built in for logging results for training purposes. Sub-optimal though as the only destination is log4j.
Has a place for feature normalization. With LambdaRank this isn't conceptually necessary, but some implementations may work better with normalized data. Needs evaluation

Missing in both

Ideally we would like to find a way to derive new features from combinations of other features without re-calculating them. While decision trees have the ability to use multiple features to come up with final ranking, sometimes (needs evaluation) they do better when features are pre-combined, such as perhaps (bm25 title match) * (contains featured_article template). This may be premature and unnecessary for initial deployment.

Currenly leaning towards forking the o19s plugin, and upstreaming where appropriate. The simplicity is nice and makes things much easier to understand, but the lack of feature logging and storing queries with the models seems undesirable. We can upstream code as we go, but if upstream isn't interested or wants to go a different way a fork prevents us from getting blocked.

working on top of o19s for now trying to find the best place to parse stored feature queries.

@EBernhardson, not sure how we'll review changes on the o19s plugin... i.e. I've uploaded https://github.com/o19s/elasticsearch-learning-to-rank/pull/27 (not directly related to this task)