Page MenuHomePhabricator

Build and AB test an ML Model with all the features exploded into individual pieces
Closed, DeclinedPublic

Description

Our current model combines many things into individual features that could be separated out:

  • standard and plain analyzer BM25 matches are joined together, these could be separate features
  • we have a couple dismax's, for example taking the 'best' of opening_text or text
  • We have the ability to expose TF and IDF as independent features per-field
  • Length of fields

We could also add in some other features we have discussed:

  • Prefix matching the title field, since there is already an index for it
  • Number of terms in query
  • Number of matching terms in field
  • Percentage of matching terms in field

We should be able to build a model and compare the NDCG@10 and NDCG@3 against the simple model we are currently using, along with running a RelForge analysis to see how large the difference there is between the two configurations. If it looks interesting we should run an AB test with it.

We also might want to do some feature selection pruning it back, but that can be computationally expensive. Basically need to train models with a "leave one out" strategy, and look at which features have little or no (or possibly negative) impact.

If the larger numbers of features is promising we could also try applying PCA for dimensionality reduction. Uncertain at this time if it will improve things, but generally we should be able to train a model and look at NDCG scores to determine if it seems useful.

Event Timeline

This has been completed in another ticket. The current model starts with ~250 features including many pieces of the similarity calculations. This i then pruned down to 50 using a feature selection algorithm.