[A/B Test] Add sister project search results in a right sidebar (test #2)
Closed, ResolvedPublic

Description

Based on the results from T149806 (see: T156300), this second A/B test for displaying sister project search results in a sidebar will have just one test group (displaying the project results based on recall) and one control group. Since the test results showed that we didn't get enough clickthroughs - mostly based on two bugs (T158935 and T158937) and the fact that the zero results rate for the queries entered on the original 4 wikis was higher than average - we decided to add in 4 additional wikipedias to be tested in this round.

This test is expected to last at least a week and will be run on the following wikipedias:

  • Persian (tested in T149806)
  • Italian (tested in T149806)
  • Catalan (tested in T149806)
  • Polish (tested in T149806)
  • Arabic
  • French
  • German
  • Russian

Test group users will see:

  • additional search results from sister wikis in a right sidebar
  • each result for the sister wiki(s) will display:
    • the top ranked result from any wiki that contain relative search results
    • an icon that denotes which wiki the result is from
    • article name of the search result
    • description of the search result
    • typical bolding of the search result term(s)
  • link below the search result that is labeled 'more results'
    • this link will open a new browser tab and display a search results page for the original search term on that sister wiki
  • separate section for multimedia results above the other sister wiki results
    • up to 3 images will be displayed that are relevant to the original search term
    • display a link that will open a new browser tab and display a search results page of multimedia for the original search term from the native wikipedia that the user is on
      • for example, if a user searched for 'gutenberg' on English Wikipedia, and clicked on the more multimedia link, the user will be displayed search results for multimedia for 'gutenberg' on English Wikipedia in a new tab.

Order of projects will be based on recall - most to least number of articles returned from each project

  • results from Commons will always be displayed first
  • Wikispecies will most likely not be included in this test cycle

Bucket testing logic generally is as follows:

  • 1 in 200 users are included in EventLogging
  • Of those 1 in 200 users, 1 in 10 are included in the test
  • Of those 1 in 10 users
    • 1/2 will go in a test group, labeled "recall_sidebar_results"
    • the remaining 1/2 of users will go in a control group, labeled "no_sidebar"
  • The remaining chunk of the original bucketed 200 users will get a NULL (the string null, or the MySQL null, we can detect either).

Eventlogging needs to capture:

  • if the user clicked on an individual result and what wiki project that result came from
  • what position in the list was the selected result
  • if the user clicked on the 'more from' on any wiki project result that was displayed
  • important to compare control group that has sister wiki results vs test group that also has sister wiki results

Eventlogging data will be joined against CirrusSearchRequestSet logging to capture:

  • if results were shown and from which wiki projects

Notes to take into account:

  • for those wikis (it, ca) that aren't selected in the bucketing, we'll need to show the existing sister wiki search results.

Sample urls of what this test could look like on the newly added wikipedias to test:

Related Objects

debt created this task.Mar 8 2017, 11:40 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 8 2017, 11:40 PM
EBernhardson added a subscriber: EBernhardson.EditedMar 14 2017, 4:16 PM

For a one week period (Mar 5, 2017 00:00:00 to Mar 12, 2017 00:00:00) we collected the following number of fulltext search sessions at the default 1:200 sampling:

wikisessions
arwiki590
dewiki4776
frwiki2165
ruwiki2227

@mpopov Ideally how many sessions would we like to record per bucket for this second test? From there i can figure out what the sampling should be set at.

EBernhardson added a comment.EditedMar 14 2017, 4:18 PM

is this test only the new wikis, or do we want to also re-run it on the original set of wikis?

The text for this ticket only mentions two buckets, recall_sidebar_results and control. Double checking that we want to drop the random_sidebar_results bucket?

For a one week period (Mar 5, 2017 00:00:00 to Mar 12, 2017 00:00:00) we collected the following number of fulltext search sessions at the default 1:200 sampling:

wikisessions
arwiki590
dewiki4776
frwiki2165
ruwiki2227

@mpopov Ideally how many sessions would we like to record per bucket for this second test? From there i can figure out what the sampling should be set at.

I'd like to get ~2k sessions from each wiki. Please and thank you.

is this test only the new wikis, or do we want to also re-run it on the original set of wikis?

The text for this ticket only mentions two buckets, recall_sidebar_results and control. Double checking that we want to drop the random_sidebar_results bucket?

Yep!

Change 342764 had a related patch set uploaded (by EBernhardson):
[mediawiki/extensions/WikimediaEvents] Re-enable sistersearch AB test

https://gerrit.wikimedia.org/r/342764

Change 342764 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents] Re-enable sistersearch AB test

https://gerrit.wikimedia.org/r/342764

Change 343104 had a related patch set uploaded (by EBernhardson):
[mediawiki/extensions/WikimediaEvents] Re-enable sistersearch AB test

https://gerrit.wikimedia.org/r/343104

Change 343104 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents] Re-enable sistersearch AB test

https://gerrit.wikimedia.org/r/343104

Per https://lists.wikimedia.org/pipermail/discovery/2017-March/001464.html, I was wondering: does the schema record the display size (or approximate size) the user has, or some other value from which to infer what was above the fold, to see if there is some correlation to the likelihood to find something?

debt added a comment.Mar 17 2017, 9:43 PM

Hi @Nemo_bis - no, at this time, we're not collecting the user's display size.

If the user is on a mobile device, all the sister project results will display at the bottom of the search results page, as shown here from my iphone:

debt added a comment.Apr 24 2017, 3:40 PM

Final analysis is posted, closing this ticket out.

debt closed this task as Resolved.Apr 24 2017, 3:40 PM
debt claimed this task.

Final analysis is posted, closing this ticket out.

There are a few things I don't understand.

  • When you measure the zero-results rate for sister projects, is that the baseline for searches which users run on those wikis, or the measure of how many searches performed on Wikipedia had some match on other wikis? The end of page 4 seems to imply the latter, since it's stated that figure 4 is a breakdown of figure 3.
  • Figure 4 would be more interesting if it showed how often the sister projects provide results when Wikipedia doesn't (although even better would be to check title matches only). Otherwise it doesn't tell anything about how much each project contributes to filling the gaps of the other projects.
  • Based on the screenshot in figure 1, it seems that only one interface was being tested, without any of the suggested tweaks of https://lists.wikimedia.org/pipermail/discovery/2017-March/001464.html. Correct?
  • It's not clear to me what's the actual control group for the Italian Wikipedia. Did the control group continue using the traditional cross-wiki search interface?
  • Is the difference in ZRR between test and group in figure 3 significant? If the same behaviour is being measured, then the search queries and result rates should be similar, shouldn't they?
  • Have you tried normalising the rates in figure 4 by wiki (or corpus) size? My suspicion is that, absent any relevance threshold, those numbers just tell us how many words a wiki contains, and hence how likely it is to have some overlapping words in some document even if the document is irrelevant. If there is some relevance threshold, then the story would be different.
    • For instance, the Wikiquote numbers make a lot of sense: the Italian Wikiquote has 40 percentage points more results than the German Wikiquote, for the very good reason that it's one order of magnitude larger.
    • On the other hand, I'm not sure that size can fully explain certain results for Wiktionary, such as a difference of 14 percentage points between Russian and French Wiktionary. I doubt that even 15 % of the total searches on Wikipedia can be for word definitions, so it feels rather unlikely that that the French Wiktionary is able to provide a (relevant) word definition in 14 % cases more, especially since the size is rather comparable (only about twice as large). It's also strange that the Persian Wiktionary has a ZRR similar to that of the Polish and Italian Wiktionaries despite being one order of magnitude smaller (same for the Catalan Wiktionary and vs. the German Wiktionary): if we can assume that the search engine analysers are equally efficient on both corpuses and that there are no measuring errors and that the results being measured are equally relevant on average, then the Catalan and Persian Wiktionaries would seem to be more "efficient" or more similar to the Wikipedia content... but this would need a lot of verification.
  • Commons was given the most prominent spot in the search results box, but there is no mention of it that I could see. Is it counted in the number of clicks? How about the zero-results rate? It would also be interesting to see which languages find more results in Commons.
  • I also had some other comments but I forgot them. This is enough for now. :)

Hi @Nemo_bis

Regarding the UI concerns

Based on the screenshot in figure 1, it seems that only one interface was being tested, without any of the suggested tweaks of https://lists.wikimedia.org/pipermail/discovery/2017-March/001464.html. Correct?

Yes, this test was conducted using the initial UI. However, since then we have revised the UI to address these concerns. Visible here
http://sistersearch.wmflabs.org/w/index.php?search=rainbow&title=Special:Search&profile=default&fulltext=1
The icons have been changed to the project logos, the images have been moved to the bottom of the sidebar, and the results have been made more compact. We're still only showing one result per project because we've gotten a lot of feedback suggesting that some results could be irrelevant or unwanted, so we'd rather stick with the most relevant result for now (this idea could be worth testing though).

  • When you measure the zero-results rate for sister projects, is that the baseline for searches which users run on those wikis, or the measure of how many searches performed on Wikipedia had some match on other wikis? The end of page 4 seems to imply the latter, since it's stated that figure 4 is a breakdown of figure 3.

Yup! It allows us to see how queries tailored to Wikipedia perform on other projects.

The dotted line (if that's what you're referring to when you say "baseline") for each project calculated from all tracked searches across all languages. For example, when aggregating across all the languages of Wikibooks in our event logging data, Wikibooks' ZRR was ~17%. I'm just clarifying it in case it wasn't clear before.

  • Figure 4 would be more interesting if it showed how often the sister projects provide results when Wikipedia doesn't (although even better would be to check title matches only). Otherwise it doesn't tell anything about how much each project contributes to filling the gaps of the other projects.

Not the point of Fig 4 but sure, that would be interesting to see.

  • It's not clear to me what's the actual control group for the Italian Wikipedia. Did the control group continue using the traditional cross-wiki search interface?

It escapes my memory :\ @Jdrewniak @EBernhardson, do you remember?

  • Is the difference in ZRR between test and group in figure 3 significant? If the same behaviour is being measured, then the search queries and result rates should be similar, shouldn't they?

The differences aren't significant, which can be seen by the overlapping confidence intervals.

  • Have you tried normalising the rates in figure 4 by wiki (or corpus) size? My suspicion is that, absent any relevance threshold, those numbers just tell us how many words a wiki contains, and hence how likely it is to have some overlapping words in some document even if the document is irrelevant. If there is some relevance threshold, then the story would be different.

I think it'd be interesting to look at ZRR/size, but I didn't spend a lot of time on ZRR because it's not what we were interested in. I included it mainly as a consistency check and as a way of checking the clickthrough results. For example, if dewiki's test's group's ZRR had been abysmally larger than dewiki's controls', that might have explained what we saw in the clickthrough section.

  • Commons was given the most prominent spot in the search results box, but there is no mention of it that I could see. Is it counted in the number of clicks? How about the zero-results rate? It would also be interesting to see which languages find more results in Commons.

Since the multimedia results are mixed with the results from Commons (rather than being exclusively from Commons) and the data around Commons results was weird & had issues in general, I omitted the multimedia box from the analysis.