We'll use this ticket to monitor the progress of the analysis of the 3rd running of this test. The test is expected to be turned on the week of Sep 5 and run for at least 7 days.
|Open||None||T174064 [FY 2017-18 Objective] Implement advanced search methodologies|
|Open||None||T171740 [Epic] Search Relevance: graded by humans|
|Open||None||T174106 Search Relevance Survey test #3: action items|
|Open||mpopov||T175048 Search Relevance Survey test #3: analysis of test|
|Resolved||mpopov||T178096 Make a Puppet profile/role for doing R-based heavy stats/ML on Wikimedia Cloud|
Rather than continuing to pester you at 8pm on a friday about the WIP report, a few comments on the text:
The “MLR (20)” experimental group had results ranked by machine learning with a rescore window of 20. This means the model was trained against labeled data for the first 20 results that were displayed to users.
The rescore doesn't effect the training, it effects the query-time evaluation. It means that each shard (of which enwiki has 7) applies the model to the top 20 results. Those 140 results are then collected and sorted to produce the top 20 shown to the user. Same for 1024, but with the bigger window (7168 docs total).
uses a Deep Belief Network
As mentioned on IRC its actually a https://en.wikipedia.org/wiki/Dynamic_Bayesian_network. It is based on http://olivier.chapelle.cc/pub/DBN_www2009.pdf and we are using the implementation from https://github.com/varepsilon/clickmodels. Might be worth somehow calling out that this is how we take click data from users and translate it into labels to train models with.
Tangentially related, i wonder if this can be used to better tune the DBN data as well. Basically the DBN can give us attractiveness and satisfaction %'s, which we currently just multiply together and then linear scale up to [0, 10]. We could potentially take the values from this click model, as well as a couple other click models (implemented in the same repository) that make different assumptions, and then learn a simple model to combine the information from the various click models to try and look like the data we get out of the relevance surveys (requires having survey data on queries that we also have enough sessions to train click models on). Or maybe that ends up being too many layers of ML, not sure.
Cool stuff, @mpopov!
I was worried about only having a binary classifier, but I see in the conclusion that it can get mapped to a 0-10 scale. Have you looked at the distribution (or the distribution when mapped to a 0-3 scale) to see if it matches the distribution of Discernatron scores in a reasonable way? I don't recall whether the Discernatron scores were, for example, strongly unimodal, or strongly bimodal, or just generally lumpy.
Overall this is a wonderful, complex analysis, and it looks like we now know what question to use and how best to turn survey data into training data. I hope it all leads to even better models!