Page MenuHomePhabricator

MLR: Mine and use negative samples
Closed, ResolvedPublic

Description

After enabling a relaxed profile for the search retrieval query we noticed that some results are particularly bad.
We believe that one of the reasons is that MLR never saw such results during training and that now some signal might not be as strong as before, for instance:

  • a single match to the title was a strong signal previously while it's no longer the case if some important terms of the query are not matched anywhere else

We could try to force the model to account for those new results by mining negative samples:

  • random negatives: likely to be actually bad but possibly too easy for the model to discard
  • hard negatives: mining extra results using IR technique (i.e. a bm25 query on the all field)

Open questions:

  • how many should we pull? (5 negatives per clicked result?)
  • where should they be placed initially (randomly assign a position? interleaved so that they get closer to the top?)

AC:

  • mjolnir is able to mine negative samples
  • a new model is trained using this technique and uploaded to production for testing

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Rough outline of a plan, I expect this will first be worked up in a notebook and evaluated. We should be able to upload the models direct from the notebook to see them operate in prod. I'm not sure if we have a way to call out models by name in a debug manner, we might have to define a prod rescore profile that can access the model variant.

  • probably work on a single wiki at first (enwiki? but the size might be annoying)
  • select the set of queries that were used in last weeks training run
  • re-run those queries through the prod clusters with a relaxed match
  • drop any result we already have labels for from the DBN
  • unseen hits should get a label matching other "unseen" results (iirc unseen gets a=0.5, s=0.5, and relevance = a*s, giving relevance=0.25. The DBN results then get a tiny downward adjustment which keeps them in result order), not sure if we need to recreate that here. But also IIRC we do something like (int)(relevance * 10) meaning they should mostly result in labels of 2 regardless.
  • collect feature vectors for the unseen results
    • double check that feature queries work appropriately with partial query matches (should)
    • consider if we need feature queries that represent "full match only". For example maybe a title match feature that gives a 0 unless all words exist in the title.
  • union the mined hits and labels with the normal feature collection outputs from last weeks training run
  • re-run hyperparam + training against the unified dataset

Once we see all that working we can probably ponder how to fit it into mjolnir.

We experiemented with this, and a model is available in production (example query), but the results just aren't good enough. Calling this complete without implementing it into mjolnir.