Page MenuHomePhabricator

Experiment with different grouping of queries that get fed into the DBN
Closed, DeclinedPublic

Description

To improve the quality of training data we could try a few things:

  • For the most part we don't care about ordering of terms. Currently we pre-group queries that have exact matches on the stemmed query string, but it seems we could try sorting the terms within the stemmed query as well
  • [NO TASK] Experiment with different thresholds for minimum group size to be fed into the DBN. Currently we filter out groups with less than 10 sessions, but we could experiment with both larger and smaller groups to see if it improves the training data.

Unfortunately evaluating these changes to the training data is difficult. Best might be to simply train up models and run AB tests with them, as long as the results don't look particularly bad.

Event Timeline

debt triaged this task as Medium priority.Oct 5 2017, 5:10 PM
debt moved this task from needs triage to Up Next on the Discovery-Search board.
debt subscribed.

This will ramp up after T177302 is completed

Moving to the sprint board as T177302 is done and we've already done a portion of this in T176493.

dcausse renamed this task from Experiement with different grouping of queries that get fed into the DBN to Experiment with different grouping of queries that get fed into the DBN.Nov 20 2017, 9:57 AM
dcausse claimed this task.
dcausse removed dcausse as the assignee of this task.EditedNov 20 2017, 12:53 PM
dcausse subscribed.

Moving back to backlog as this task actually covers 2 experiments and thought it was new:

Should we experiment with smaller group size 5?
Should we experiment with the second suggestion to reorder query terms?

Moving back to backlog as this task actually covers 2 experiments and thought it was new:

Should we experiment with smaller group size 5?

Depends on the results of the previous AB test. I think since both 20 and 35 were noticably worse than our arbitrarily chosen default of 10, it's worth testing groups with a smaller size.

Should we experiment with the second suggestion to reorder query terms?

I think this is worthwhile. I do wonder if it will invalidate some of the things we learn about query group sizing though. I suppose it really depends on what hidden variable is changing the user behaviour:

  • Is less training data from larger group sizes resulting in less optimization? I don't think this is the case, as the offline ndcg@10 of the model shows a larger increase over baseline with the larger query groupings.
  • Are the larger minimum dbn groups throwing out important long tail (or middle tail? there is a huge long tail after 10 sessions per group still..) training data? My intuition is that this is whats happening, but I'm not sure how to validate this.

Assuming the second is true it seems sorting the terms may allow us to get even more of the long tail information into the DBN. On the other hand it may have an unintended effect of putting together queries that aren't as related as we hope. I'm optimistic the second-stage of grouping will negate any poor handling here.

I also think this can at least partly be tested offline:

  • Measure the number of first and second stage (or only second stage? first might only be interesting but not particularly useful) groups for both methods of normalizing. Might be interesting to also pull basic count statistics (min/max/std, from spark df.describe()) on sessions per group.
  • Could perhaps look at a small sampling of the groups to decide if they look to be any better/worse, or if there are obvious cases where wrong things are being grouped together.

Maybe more? not sure.

As discussed in today's sprint planning, we'll need to run this test again with a smaller sample size.

to clarify, the smaller is in the sizes of the groups. Basically 10 (the default) performed noticably better than both grouping sizes we tested (20, 35). We will run again with smaller sizes (tbd, maybe 5, 8 and 15? 12? i dunno)

Gehel subscribed.

Closing for now. This might be reactivated as part of a larger initiative to improve MLR, but does not make much sense on its own.