Take the click data generated in T162054 and generate labels using a Dynamic Bayesian Network (DBN). This has previously been tested using https://github.com/varepsilon/clickmodels and spark. DBN training is independent per-query, so the job can be fairly naively parallelized as long as all data for a wiki+query is on a single partition.
DataThe process should feed into an output table in the discovery database in hivereturn a dataframe containing (wikiid, norm_query, hit_page_id, relevance) rows.
Considerations:
* We probably want a sample of the original clicks, rather than everything. Although if the data size/processing requirements aren't too excessive we could calculate everything.
** If we do calculate everything, we still need the below step at some point. Perhaps this will be informed by testing various sizes of data. It seems likely that for a given # of features there is a limit to the amount of data that continues to improve the result.
* Should be parameterized to take the* Use stratified sample from a few different parts of the loging, for example (not sure exactly yet):
** Top 1k most popular queries
** 10k random queries from the top 50k queries
** 10k random queries with enough click data to train a DBNwith groups based on popularity percentile of query, to ensure we get equal samples of both popular and unpopular queries. This may be overkill since we are taking a large number of samples from each group, but shouldn't hurt much.
* Data from different wikis should be grouped independently
** Might only have enough data for top 10 or 15 wikis
Data must be purged after 90 days to comply with privacy policy. We generally want as much data as possible to feed into the DBN, so perhaps the job should run weekly and bring in the last 83 days worth of click data, getting purged a week later when new data is generated.