Our current model combines many things into individual features that could be separated out:
- standard and plain analyzer BM25 matches are joined together, these could be separate features
- we have a couple dismax's, for example taking the 'best' of opening_text or text
- We have the ability to expose TF and IDF as independent features per-field
- Length of fields
We could also add in some other features we have discussed:
- Prefix matching the title field, since there is already an index for it
- Number of terms in query
- Number of matching terms in field
- Percentage of matching terms in field
We should be able to build a model and compare the NDCG@10 and NDCG@3 against the simple model we are currently using, along with running a RelForge analysis to see how large the difference there is between the two configurations. If it looks interesting we should run an AB test with it.
We also might want to do some feature selection pruning it back, but that can be computationally expensive. Basically need to train models with a "leave one out" strategy, and look at which features have little or no (or possibly negative) impact.
If the larger numbers of features is promising we could also try applying PCA for dimensionality reduction. Uncertain at this time if it will improve things, but generally we should be able to train a model and look at NDCG scores to determine if it seems useful.