Extract TF and IDF based features in the elasticsearch learning to rank plugin
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Jun 8 2017, 4:29 PM

Description

For example with the query 'what is love' I would like to be able to extract the following features:

Sum of term frequencies
Mean term frequency
Max individual term frequency
Min individual term frequency
Standard deviation of term frequencies

And the same for IDF. These are somewhat standard features commonly found in LTR literature. For context we will likely be extracting these for multiple fields, probably for each article against the title, opening text, text, categories, headings, redirects, and auxiliary text. We also have two different analysis chains for each field, so in the end it's something like 5*2*7*2=140 features.

We may find that some, or heck even all, are not useful features so it would be good to have a passable implementation first that can be used to train a model, evaluate impact, and then if necessary extend it into something more performant for production usage. For performance I'm mostly thinking about total latency.

We could extract each feature individually, or it seems plausible the set of 5 or even all 10 (TF and IDF) could be extracted in a single pass. A single pass has the benefit of only running through the analysis stage once instead of 10 times, and similarly extracting the basic TF and IDF data from Lucene a single time. Having not measured I can't say how important that is vs the relatively simple implementation of doing a query for each individual feature.

Implementation Notes:

In the end I think we'd like to have a new elasticsearch query like:
"match_explorer": {

"query": "what is love"
"field": "title",
"analyzer": "my_custom_analyzer", (optional, the search_analyzer of the field should be used by default)
"output": "sum_raw_tf"

}

For a start I'd look only at raw values you'll find in PostingsEnum (raw_tf) and CollectionStatistics (raw_df, raw_ttf) but later it'd be interesting to extract normalized values used by some Similarity function like BM25. Sadly similarity classes does not seem to expose such values in a standard way, so we may end up writing custom "score output" for BM25 or any other similarity function we are interested in.

I'd add a simple "num_terms" output to return the number of terms in the query.

Event Timeline

EBernhardson created this task.Jun 8 2017, 4:29 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 8 2017, 4:29 PM

EBernhardson moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Jun 8 2017, 4:29 PM

debt triaged this task as Medium priority.Jun 13 2017, 5:23 PM

EBernhardson moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Jun 20 2017, 6:55 PM

patch at https://github.com/worleydl/elasticsearch-learning-to-rank/tree/feature/explorer

this has been merged to the 1_0 branch: https://github.com/o19s/elasticsearch-learning-to-rank/commits/1_0

debt closed this task as Resolved.Jul 7 2017, 9:06 PM

Extract TF and IDF based features in the elasticsearch learning to rank pluginClosed, ResolvedPublicActions

Description

Event Timeline

Extract TF and IDF based features in the elasticsearch learning to rank plugin
Closed, ResolvedPublic
Actions