Page MenuHomePhabricator

Extract TF and IDF based features in the elasticsearch learning to rank plugin
Closed, ResolvedPublic


For example with the query 'what is love' I would like to be able to extract the following features:

Sum of term frequencies
Mean term frequency
Max individual term frequency
Min individual term frequency
Standard deviation of term frequencies

And the same for IDF. These are somewhat standard features commonly found in LTR literature. For context we will likely be extracting these for multiple fields, probably for each article against the title, opening text, text, categories, headings, redirects, and auxiliary text. We also have two different analysis chains for each field, so in the end it's something like 5*2*7*2=140 features.

We may find that some, or heck even all, are not useful features so it would be good to have a passable implementation first that can be used to train a model, evaluate impact, and then if necessary extend it into something more performant for production usage. For performance I'm mostly thinking about total latency.

We could extract each feature individually, or it seems plausible the set of 5 or even all 10 (TF and IDF) could be extracted in a single pass. A single pass has the benefit of only running through the analysis stage once instead of 10 times, and similarly extracting the basic TF and IDF data from Lucene a single time. Having not measured I can't say how important that is vs the relatively simple implementation of doing a query for each individual feature.

Implementation Notes:

In the end I think we'd like to have a new elasticsearch query like:
"match_explorer": {

"query": "what is love"
"field": "title",
"analyzer": "my_custom_analyzer", (optional, the search_analyzer of the field should be used by default)
"output": "sum_raw_tf"


For a start I'd look only at raw values you'll find in PostingsEnum (raw_tf) and CollectionStatistics (raw_df, raw_ttf) but later it'd be interesting to extract normalized values used by some Similarity function like BM25. Sadly similarity classes does not seem to expose such values in a standard way, so we may end up writing custom "score output" for BM25 or any other similarity function we are interested in.

I'd add a simple "num_terms" output to return the number of terms in the query.