Page MenuHomePhabricator

Interleaved results A/B test: analysis of data
Closed, ResolvedPublic


Now that the A/B test for interleaved results is done, let's do some analysis on the data we collected!

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Looks like this test will be turned off either today or Wednesday, Aug 16, 2017 (Aug 15 is a WMF holiday) and so we'll need to take a look at the data received soon! :)

mpopov added subscribers: TJones, dcausse, EBernhardson.

First draft up at

@EBernhardson @dcausse please let me know if I got any of technical details wrong or if there's important stuff that I missed.

@TJones @chelsyx Please review when you can.


@mpopov great report thanks!
We deployed yesterday a model for enwiki and it's using a rescore window of 1024, should we review the conclusion which suggest a rescore window of 20?
I totally agree with what you suggest (marginal improvement over the 20 rescore window and computationally intensive).
@EBernhardson can tell us more but I think one reason we deployed the 1024 rescore window is that we hope it'll will push new results to the first page, giving new data to the ClickModels for training new labels on the next iteration.

Great job @mpopov !

A few comments:

  • In the "Methods" section, can we indicate the re-sample size of each iteration of bootstrapping? It may not be obvious to some readers.
  • In "Traditional test" - "Position of first click", is the difference between experimental groups significant?
  • In "Traditional test" - "Page visit times", it seems to me that MLR@20 (green line) are less likely to stay in the first 200 seconds, and there is not a big difference between MLR@1024 and BM25.
  • In "Traditional test" - "Pagination navigation":
    • "15-20 results" should be "1-20 results" in both the paragraph and the figure caption
    • The ylab of the figure should be "proportion of searches where users see additional results"
  • In "Interleaved test" - "Page visit times", the legend of the graph is cut off, and it doesn't show which line belongs to which ranking methods (only "team=A" or "team=B")

I'm a bit late to the party, but I've emailed my feedback to @mpopov.

For 1024 vs 20, 1024 isn't incredibly expensive. I mean it does cost more but it doesn't seem prohibitive. I ran load tests with rescore windows up to 4096 and it was all fine. The benefit we would get from a larger rescore window i suppose is the ability to pull things up from further down in the retrieval query phase. The theory is that there are plenty of good things down there, and if we allow the LTR to reach down that far it will find them. This gives new information to the DBN, which hopefully builds a feedback loop where the LTR brings in some new result types from further down that it thinks are good, users evaluate them, and then the DBN learns new labels based on those to inform the next model we build. In theory at least.

In "Experimental Groups" the text " collected and sorted to produce the top 1024 shown to the user." should perhaps just say 20? Not sure how to best explain it. We allow people to paginate up to 10k results. For the rescore window of 20 users will transition from the LTR ranking to the traditional LTR ranking at result number 140. Similarly for the 1024 model it will transition from LTR ranked results to BM25 at result number 7168. In general though only bots really read past page 2, and a very small number of humans read past page 1 (where each page has, by default, 20 results).

Still reviewing the rest, but overall looks pretty good.

I was pondering the large difference in ZRR. I know it's not statistically significant, but it still seems odd. I poked a little in the code for the report and didn't see anything about filtering out high volume users. It's not certainly the case, but in the past i've seen some high volume (probably bots) users end up in the AB testing and they can skew metrics (like ZRR) that aren't per session. Perhaps could either try and filter them out, or look at a per-session ZRR (# of sessions without a result).

Additionally in my own analysis where i'm trying to figure out why the dashboard stats didn't change as deployed the model, to match the AB test, i'm seeing the following:

Without collapsing serp events with same query (the click, read, back), and without filtering any sessions:


Filtering out sessions with more than 20 serp+click events, which removes 254 high volume sessions with 25634 events:


A metric like session abandonment might also be more immune to this, if it is whats happening. Session abandonment would be any search session with 0 click throughs. Potentially also looking at some minimum dwell times like all sessions with no dwell time >= 10s or some such.

Bootstrapping finally finished -_- second draft up at

Using a threshold of 20 searches per session, the results are…milder.

debt moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.

@EBernhardson and @TJones can you take a look at the latest draft of the test result, when you get a chance?

looks reasonable to me, although the results are a little underwhelming. Need to figure out where the next steps are, either improving the training data or improving the features provided to the training algorithm.

Looks good, @mpopov!

There's also some hope that simple iteration will help—new clicks on stuff the LTR surfaced will improve the model to bring up more new stuff. We'll see how that plays out.

@mpopov can you take a look and see if we can do more for this? or is the machine learning as good as what we're doing now? anything definitive we can do?

Final draft up at

@mpopov can you take a look and see if we can do more for this? or is the machine learning as good as what we're doing now? anything definitive we can do?

The big next step would be to use a relevance prediction model I'm working on in T175048 along with more surveys to then gather more training data to augment the existing click-based training data.

Need to figure out where the next steps are, either improving the training data or improving the features provided to the training algorithm.

But also yeah, maybe we can schedule a meeting to go over existing features and brainstorm some new ones?

@mpopov sure, I'll reserve time during our sprint planning meeting today to chat about next steps.

We had a meeting with the Search Platform and Analysis teams this morning, here's some highlights (from notes taken during the meeting):

  • High level sysnopsis: on backend from training data, we captured about 20% and it made minor diffs to clickthrough, but was better on relevance, people are clicking higher in results, but its not turning into less abandoned searches. It seems that the DBN's opinion is reasonably good. We are looking at data for what's causing SERP abandons, which often look like good results to us
  • We don't really understand abandonment, maybe they found what they needed, maybe it's bots, we should really try to understand what's happening there
  • Improving features can improve, but might not take us very far, should focus on training data and use relevance surveys to better tune DBN data. If we want to make major improvements, though, we need to provide new results.
  • Abandonment rates might just be like ZRR, which we can't make any better.
  • if we capture other information like referrer we could potentially understand this better.
    • It could be the person is not actually interested in wikipedia (hard to detect), or they find what they want in the snippets and that's all they need (could do a quick AB test, not showing the snippet and see if the clickthrough increases)
  • It would be great if we had some way to classify users as to what they're doing and separate them into groups; people are doing different things and have different goals, so maybe there's a way we can facet on those goals.
  • We don't have a quantification of what good search results even are.
    • People can be satisfied with the results without clicking through.
  • if users come from autocomplete, the autocomplete checkins fire (and all should come through autocomplete, by typing the query)
    • We compute dwell time and scroll on SERP in our auto report
    • the dwell time on the SERP might also be because the user is reading the sister project snippets
    • dashboard that tracks time on SERP
  • we should look for obvious patterns
    • will add mobile vs desktop to data that we're collecting
  • If we do random surveys of general happiness/satisfaction, it would probably work (see T178006)
    • asking about their current search results and their query
    • A new survey would need to ask about the current search, not whether something is relevant based on selected queries
    • If someone says they're happy and they leave, then we know the snippet is what satisfied them
  • will add referrer into the data that is being collected (to figure out how they got to the search results page)
    • We have satisfaction data, will add in an hourly process that joins them together with referrers

Do click-throughs count when the result is opened in another tab/window - I do this quite often when I'm not sure which of a few results is the one I want.

Last night I searched for "Pic (programming language)". The search results gave me "Pic language" as the top result, which I opened in a new tab and then clicked the "you may create the page" link in the original tab and created a redirect.

I doubt its a statistically significant use of search, but when dealing with redirects it is not uncommon for me (and others I presume) to specifically look for search results to see what articles are found - does the search engine do a better job than a redirect would? Is there a need for a disambiguation page? I may not load any of the results, but the search was a success.

If I'm crafting a disambig, then I will usually go very deep into the search results loading lots of the pages, not all of which will be relevant. Sometimes I need to load the page to see if it is relevant, sometimes just the title and/or snippet is enough and I can add it without opening the link.

If possible, tracking whether the searcher's next logged action was a page creation at a title that would now be found by the search they just performed would possibly prove inciteful.

@Thryduulf thanks for your feedback.
Yes a click to another tab is taken into account but it's very likely that the system is confused by some user behaviors and we try to understand those. (e.g. why do we see so many abandoned search requests where we think the results suggested are appropriate).

Creating meaningful redirects is one of the most important action an editor can do to improve search results. For instance see the how a machine learning algorithm prioritize the various features used for ranking :

feat_imp.png (426×1 px, 75 KB)

As you can see title and redirects (all near match being title and redirects squashed together) are among the most important features.

Concerning disambiguation pages I'm sure they contain very important information we could use in search but we still don't know how to make any benefit from them.

Tracking if the search led to a page creation or a page edit is something we have talk about but we still have not implemented it yet.