To improve the quality of training data we could try a few things:
- For the most part we don't care about ordering of terms. Currently we pre-group queries that have exact matches on the stemmed query string, but it seems we could try sorting the terms within the stemmed query as well
- [NO TASK] Experiment with different thresholds for minimum group size to be fed into the DBN. Currently we filter out groups with less than 10 sessions, but we could experiment with both larger and smaller groups to see if it improves the training data.
Unfortunately evaluating these changes to the training data is difficult. Best might be to simply train up models and run AB tests with them, as long as the results don't look particularly bad.