Take the click data generated in T162054 and generate labels using a Dynamic Bayesian Network (DBN). This has previously been tested using https://github.com/varepsilon/clickmodels and spark. DBN training is independent per-query, so the job can be fairly naively parallelized as long as all data for a wiki+query is on a single partition.
The process should return a dataframe containing (wikiid, norm_query, hit_page_id, relevance) rows.
Considerations:
- We probably want a sample of the original clicks, rather than everything. Although if the data size/processing requirements aren't too excessive we could calculate everything.
- If we do calculate everything, we still need the below step at some point. Perhaps this will be informed by testing various sizes of data. It seems likely that for a given # of features there is a limit to the amount of data that continues to improve the result.
- Use stratified sampling, with groups based on popularity percentile of query, to ensure we get equal samples of both popular and unpopular queries. This may be overkill since we are taking a large number of samples from each group, but shouldn't hurt much.
- Data from different wikis should be grouped independently
- Might only have enough data for top 10 or 15 wikis