Search Relevance Survey test #3: analysis of test
Open, NormalPublic4 Story Points

Description

We'll use this ticket to monitor the progress of the analysis of the 3rd running of this test. The test is expected to be turned on the week of Sep 5 and run for at least 7 days.

debt created this task.Sep 5 2017, 5:28 PM
debt updated the task description. (Show Details)
EBernhardson added a comment.EditedSep 9 2017, 2:58 AM

Rather than continuing to pester you at 8pm on a friday about the WIP report, a few comments on the text:

The “MLR (20)” experimental group had results ranked by machine learning with a rescore window of 20. This means the model was trained against labeled data for the first 20 results that were displayed to users.

The rescore doesn't effect the training, it effects the query-time evaluation. It means that each shard (of which enwiki has 7) applies the model to the top 20 results. Those 140 results are then collected and sorted to produce the top 20 shown to the user. Same for 1024, but with the bigger window (7168 docs total).

uses a Deep Belief Network

As mentioned on IRC its actually a https://en.wikipedia.org/wiki/Dynamic_Bayesian_network. It is based on http://olivier.chapelle.cc/pub/DBN_www2009.pdf and we are using the implementation from https://github.com/varepsilon/clickmodels. Might be worth somehow calling out that this is how we take click data from users and translate it into labels to train models with.

mpopov claimed this task.Sep 28 2017, 4:43 PM
mpopov set the point value for this task to 4.
mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.
EBernhardson added a comment.EditedOct 10 2017, 5:21 PM

Tangentially related, i wonder if this can be used to better tune the DBN data as well. Basically the DBN can give us attractiveness and satisfaction %'s, which we currently just multiply together and then linear scale up to [0, 10]. We could potentially take the values from this click model, as well as a couple other click models (implemented in the same repository) that make different assumptions, and then learn a simple model to combine the information from the various click models to try and look like the data we get out of the relevance surveys (requires having survey data on queries that we also have enough sessions to train click models on). Or maybe that ends up being too many layers of ML, not sure.

debt added a comment.Oct 31 2017, 8:08 PM

Models have finished training and we just need to finish up the analysis, yay!

debt added a comment.Mon, Nov 27, 4:42 PM

Nice job, @mpopov! @EBernhardson and @TJones can you both take a look, please? :)

Cool stuff, @mpopov!

I was worried about only having a binary classifier, but I see in the conclusion that it can get mapped to a 0-10 scale. Have you looked at the distribution (or the distribution when mapped to a 0-3 scale) to see if it matches the distribution of Discernatron scores in a reasonable way? I don't recall whether the Discernatron scores were, for example, strongly unimodal, or strongly bimodal, or just generally lumpy.

Overall this is a wonderful, complex analysis, and it looks like we now know what question to use and how best to turn survey data into training data. I hope it all leads to even better models!

I was worried about only having a binary classifier, but I see in the conclusion that it can get mapped to a 0-10 scale. Have you looked at the distribution (or the distribution when mapped to a 0-3 scale) to see if it matches the distribution of Discernatron scores in a reasonable way? I don't recall whether the Discernatron scores were, for example, strongly unimodal, or strongly bimodal, or just generally lumpy.

I'll check out how it compares with respect to the distribution!

I'll check out how it compares with respect to the distribution!

Many thanks! Both @EBernhardson and I (and maybe others) were interested in the comparison. It'll be neat to see.