Page MenuHomePhabricator

Generate DBN relevance labels from click data
Closed, ResolvedPublic

Description

Take the click data generated in T162054 and generate labels using a Dynamic Bayesian Network (DBN). This has previously been tested using https://github.com/varepsilon/clickmodels and spark. DBN training is independent per-query, so the job can be fairly naively parallelized as long as all data for a wiki+query is on a single partition.

The process should return a dataframe containing (wikiid, norm_query, hit_page_id, relevance) rows.

Considerations:

  • We probably want a sample of the original clicks, rather than everything. Although if the data size/processing requirements aren't too excessive we could calculate everything.
    • If we do calculate everything, we still need the below step at some point. Perhaps this will be informed by testing various sizes of data. It seems likely that for a given # of features there is a limit to the amount of data that continues to improve the result.
  • Use stratified sampling, with groups based on popularity percentile of query, to ensure we get equal samples of both popular and unpopular queries. This may be overkill since we are taking a large number of samples from each group, but shouldn't hurt much.
  • Data from different wikis should be grouped independently
    • Might only have enough data for top 10 or 15 wikis

Event Timeline

A basic form of this was written for my initial evaluation, currently located on stat1002.eqiad.wmnet:/a/ebernhardson/spark_feature_log/code/data_process_dbn.py

For posterity also copying it into a pastie: P5191

In initial tests of this I found that grouping together queries with the same normalized query (determined by applying the standard english stemmer from lucene to queries from english wikipedia) are all considered the same query. That may be a bit too naive, but seems like a reasonable first step. Probably need to browse through some examples to see if anything particularly crazy and different seems to be grouped together. We could do a slightly more expensive comparison, where not only does it need to stem to similar results but the returned result sets should be within X% of each other to be considered the "same" query but with slightly different text.

Additionally, duplicate (query, page_id) data should probably be de-duplicated, and a weight column added to indicate how many copies there were after training the DBN. The DBN itself needs to receive each user session as an independent item, but later feature generation and training will not want to deal with all these duplicates.. The weight may or may not be used in the training step (tbd).

Some partial stats about what our distribution of queries looks like. I'll run this again once we have more data, this is only against ~10 days when we plan to use 80-90 days. We will be able to calculate 60 days worth of data in the next day or two, but due to retention on the web requeste logs after that it will only grow one day at a time until 90 days.

Query used: P5263

min sessions per norm querydistinct normalized queriesdistinct sessionsdistinct queries
13,436,7315,652,0514,101,427
5129,0981,482,335449,516
1040,872929,192206,112
1521,180700,815130,281
2013,069565,22392,464
258,991476,49670,790
306,586412,09056,484
355,018362,12646,404

Initial concern: Using queries with >= 10 sessions is only ~16% of the sessions. Going all the way to 35 is 6.5%. I expect aggregating over a longer time span will bring these percentages up, but how far? The addition of manually labeled data for the long tail will probably be relatively important, and need to have reasonable weights set on it.

with 40 days:

min sessions per querydistinct normalized queriesdistinct sessionsdistinct queries
19,471,45119,638,23812,236,331
5557,4977,980,0122,174,441
10207,9275,765,2371,174,304
15117,8614,719,483813,983
2078,2324,056,662621,691
2556,8813,591,601503,888
3043,7313,238,903423,572
3534,9612,959,767364,689

Queries with >= 10 sessions increased to 29%, >= 35 sessions to 15%. It's still calculating up to 60 days worth.

The backfilling job has finished, so we now have click logs for feb 13 - apr 12.

min sessions per querydistinct normalized queriesdistinct sessionsdistinct queries
112,356,40327,710,35016,370,353
5811,90512,477,6753,268,928
10316,1729,329,2581,839,073
15182,5767,775,3701,30,0387
20123,6716,789,8241,012,995
2590,9706,077,090829,914
3070,3235,523,346701,877
35568815,095,263611,103

Over 60 days normalized queries with > 10 sessions make up 33.6% of all sessions. queries with > 35 sessions make up 18.3% of all sessions. This will rise a little more as we approach 90 days, but i wouldn't expect much more than perhaps 40% and 20% respectively.

Change 347566 had a related patch set uploaded (by EBernhardson):
[search/MjoLniR@master] Sample input sessions

https://gerrit.wikimedia.org/r/347566

Change 347566 merged by DCausse:
[search/MjoLniR@master] Sample input sessions

https://gerrit.wikimedia.org/r/347566

EBernhardson renamed this task from Oozie job for generating DBN relevance labels from click data to Generate DBN relevance labels from click data.May 30 2017, 4:45 PM
EBernhardson updated the task description. (Show Details)

We have decided to skip the oozie portion of things for now. Adjusted the task slightly to indicate this is about implementing the sampling of input queries, calculation of relevance labels, and attaching those labels to the input events.

debt added a subscriber: debt.

This has been merged but not yet in production.