[A/B Test] Add sister project search results in a right sidebar (test #2)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	debt
	Mar 8 2017, 11:40 PM

Description

Based on the results from T149806 (see: T156300), this second A/B test for displaying sister project search results in a sidebar will have just one test group (displaying the project results based on recall) and one control group. Since the test results showed that we didn't get enough clickthroughs - mostly based on two bugs (T158935 and T158937) and the fact that the zero results rate for the queries entered on the original 4 wikis was higher than average - we decided to add in 4 additional wikipedias to be tested in this round.

This test is expected to last at least a week and will be run on the following wikipedias:

Persian (tested in T149806)
Italian (tested in T149806)
Catalan (tested in T149806)
Polish (tested in T149806)
Arabic
French
German
Russian

Test group users will see:

additional search results from sister wikis in a right sidebar
each result for the sister wiki(s) will display:
- the top ranked result from any wiki that contain relative search results
- an icon that denotes which wiki the result is from
- article name of the search result
- description of the search result
- typical bolding of the search result term(s)
link below the search result that is labeled 'more results'
- this link will open a new browser tab and display a search results page for the original search term on that sister wiki
separate section for multimedia results above the other sister wiki results
- up to 3 images will be displayed that are relevant to the original search term
- display a link that will open a new browser tab and display a search results page of multimedia for the original search term from the native wikipedia that the user is on
  - for example, if a user searched for 'gutenberg' on English Wikipedia, and clicked on the more multimedia link, the user will be displayed search results for multimedia for 'gutenberg' on English Wikipedia in a new tab.

Order of projects will be based on recall - most to least number of articles returned from each project

results from Commons will always be displayed first
Wikispecies will most likely not be included in this test cycle

Bucket testing logic generally is as follows:

1 in 200 users are included in EventLogging
Of those 1 in 200 users, 1 in 10 are included in the test
Of those 1 in 10 users
- 1/2 will go in a test group, labeled "recall_sidebar_results"
- the remaining 1/2 of users will go in a control group, labeled "no_sidebar"
The remaining chunk of the original bucketed 200 users will get a NULL (the string null, or the MySQL null, we can detect either).

Eventlogging needs to capture:

if the user clicked on an individual result and what wiki project that result came from
what position in the list was the selected result
if the user clicked on the 'more from' on any wiki project result that was displayed
important to compare control group that has sister wiki results vs test group that also has sister wiki results

Eventlogging data will be joined against CirrusSearchRequestSet logging to capture:

if results were shown and from which wiki projects

Notes to take into account:

for those wikis (it, ca) that aren't selected in the bucketing, we'll need to show the existing sister wiki search results.

Sample urls of what this test could look like on the newly added wikipedias to test:

Details

	Subject	Repo	Branch	Lines +/-
	Re-enable sistersearch AB test	mediawiki/extensions/WikimediaEvents	wmf/1.29.0-wmf.16	+76 -3
	Re-enable sistersearch AB test	mediawiki/extensions/WikimediaEvents	master	+76 -3

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	debt	T139310 [UI design] Search Results page: re-design page display of search results across languages
Resolved	debt	T145917 [EPIC] Front end: cross-wiki search results display
Resolved	debt	T160004 [A/B Test] Add sister project search results in a right sidebar (test #2)
Resolved	debt	T158937 Update sister search results for commons (currently in reversed order)
Resolved	debt	T158935 update sister search results links to be displayed in blue
Resolved	debt	T160005 turn on A/B test for displaying sister project search results (test #2)
Resolved	debt	T160006 turn off A/B test for displaying sister project search results (test #2)
Resolved	mpopov	T160008 Analyze results of A/B test for displaying sister project search results (test #2)

Event Timeline

debt created this task.Mar 8 2017, 11:40 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 8 2017, 11:40 PM

debt added a subtask: T158937: Update sister search results for commons (currently in reversed order).Mar 8 2017, 11:41 PM

debt added a subtask: T158935: update sister search results links to be displayed in blue.

debt created subtask T160005: turn on A/B test for displaying sister project search results (test #2).Mar 8 2017, 11:45 PM

debt created subtask T160006: turn off A/B test for displaying sister project search results (test #2).Mar 8 2017, 11:47 PM

debt removed a subtask: T160006: turn off A/B test for displaying sister project search results (test #2).

debt mentioned this in T160008: Analyze results of A/B test for displaying sister project search results (test #2).Mar 8 2017, 11:52 PM

debt moved this task from Incoming to UI on the Discovery-Search (Current work) board.

As stated in T154525 - (thanks, @CKoerner_WMF!)

This quarter we want to extend the A/B test to 4 additional communities. We've contacted each with an update letting them know the testing will begin (unless issues or conflicts are identified) in mid to late March.

These four communities are the Arabic, French, German, and Russian Wikipedias. Message to communities are listed below.

ar: https://ar.wikipedia.org/wiki/ويكيبيديا:الميدان/تقنية#.D8.A7.D8.AE.D8.AA.D8.A8.D8.A7.D8.B1_.D9.86.D8.AA.D8.A7.D8.A6.D8.AC_.D8.A7.D9.84.D8.A8.D8.AD.D8.AB_.D8.A7.D9.84.D8.AA.D9.8A_.D8.AA.D8.B4.D9.85.D9.84_.D8.A3.D9.83.D8.AB.D8.B1_.D9.85.D9.86_.D9.85.D9.88.D9.82.D8.B9_.D9.88.D9.8A.D9.83.D9.8A

de: https://de.wikipedia.org/wiki/Wikipedia:Technik/Werkstatt#A.2FB_test_zur_Multi-Wiki_Suchfunktion

fr: https://fr.wikipedia.org/wiki/Wikipédia:Le_Bistro/10_mars_2017#Test_A.2FB_pour_les_r.C3.A9sultats_de_recherche_inter-wikis

ru: https://ru.wikipedia.org/wiki/Википедия:Форум/Технический#.D0.90.2FB-.D1.82.D0.B5.D1.81.D1.82.D0.B8.D1.80.D0.BE.D0.B2.D0.B0.D0.BD.D0.B8.D0.B5_.D1.80.D0.B5.D0.B7.D1.83.D0.BB.D1.8C.D1.82.D0.B0.D1.82.D0.BE.D0.B2_.D0.BF.D0.BE.D0.B8.D1.81.D0.BA.D0.B0_.D0.BC.D0.B5.D0.B6.D0.B4.D1.83_.D0.B2.D0.B8.D0.BA.D0.B8

For a one week period (Mar 5, 2017 00:00:00 to Mar 12, 2017 00:00:00) we collected the following number of fulltext search sessions at the default 1:200 sampling:

wiki	sessions
arwiki	590
dewiki	4776
frwiki	2165
ruwiki	2227

@mpopov Ideally how many sessions would we like to record per bucket for this second test? From there i can figure out what the sampling should be set at.

is this test only the new wikis, or do we want to also re-run it on the original set of wikis?

The text for this ticket only mentions two buckets, recall_sidebar_results and control. Double checking that we want to drop the random_sidebar_results bucket?

debt mentioned this in T160463: Update event logging for sister project search results.Mar 14 2017, 8:19 PM

In T160004#3098407, @EBernhardson wrote:

For a one week period (Mar 5, 2017 00:00:00 to Mar 12, 2017 00:00:00) we collected the following number of fulltext search sessions at the default 1:200 sampling:

wiki sessions

arwiki 590

dewiki 4776

frwiki 2165

ruwiki 2227

@mpopov Ideally how many sessions would we like to record per bucket for this second test? From there i can figure out what the sampling should be set at.

I'd like to get ~2k sessions from each wiki. Please and thank you.

In T160004#3098416, @EBernhardson wrote:

is this test only the new wikis, or do we want to also re-run it on the original set of wikis?

The text for this ticket only mentions two buckets, recall_sidebar_results and control. Double checking that we want to drop the random_sidebar_results bucket?

Yep!

Change 342764 had a related patch set uploaded (by EBernhardson):
[mediawiki/extensions/WikimediaEvents] Re-enable sistersearch AB test

https://gerrit.wikimedia.org/r/342764

gerritbot added a project: Patch-For-Review.Mar 14 2017, 10:52 PM

debt moved this task from UI to Needs review on the Discovery-Search (Current work) board.Mar 15 2017, 7:22 PM

debt mentioned this in T156300: analysis of results from A/B/C test for displaying sister projects in search results.Mar 16 2017, 1:49 PM

Change 342764 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents] Re-enable sistersearch AB test

https://gerrit.wikimedia.org/r/342764

Change 343104 had a related patch set uploaded (by EBernhardson):
[mediawiki/extensions/WikimediaEvents] Re-enable sistersearch AB test

https://gerrit.wikimedia.org/r/343104

ReleaseTaggerBot added a project: MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)).Mar 16 2017, 7:00 PM

debt mentioned this in T160005: turn on A/B test for displaying sister project search results (test #2).Mar 16 2017, 10:26 PM

Change 343104 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents] Re-enable sistersearch AB test

https://gerrit.wikimedia.org/r/343104

ReleaseTaggerBot edited projects, added MW-1.29-release (WMF-deploy-2017-03-14_(1.29.0-wmf.16)); removed MW-1.29-release (WMF-deploy-2017-03-21_(1.29.0-wmf.17)).Mar 17 2017, 1:00 AM

Per https://lists.wikimedia.org/pipermail/discovery/2017-March/001464.html, I was wondering: does the schema record the display size (or approximate size) the user has, or some other value from which to infer what was above the fold, to see if there is some correlation to the likelihood to find something?

Hi @Nemo_bis - no, at this time, we're not collecting the user's display size.

If the user is on a mobile device, all the sister project results will display at the bottom of the search results page, as shown here from my iphone:

mobile-view-sister-projects.PNG (2×1 px, 330 KB)

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Mar 20 2017, 8:29 PM

debt mentioned this in T149806: [A/B/C Test] Add cross-wiki search results in a right sidebar.Mar 31 2017, 8:33 PM

debt closed subtask T158937: Update sister search results for commons (currently in reversed order) as Resolved.

debt closed subtask T158935: update sister search results links to be displayed in blue as Resolved.

debt closed subtask T160005: turn on A/B test for displaying sister project search results (test #2) as Resolved.

Liuxinyu970226 subscribed.Apr 1 2017, 4:38 AM

Final analysis is posted, closing this ticket out.

debt closed this task as Resolved.Apr 24 2017, 3:40 PM

debt claimed this task.

Liuxinyu970226 unsubscribed.Apr 25 2017, 12:34 PM

In T160004#3206523, @debt wrote:

Final analysis is posted, closing this ticket out.

There are a few things I don't understand.

When you measure the zero-results rate for sister projects, is that the baseline for searches which users run on those wikis, or the measure of how many searches performed on Wikipedia had some match on other wikis? The end of page 4 seems to imply the latter, since it's stated that figure 4 is a breakdown of figure 3.
Figure 4 would be more interesting if it showed how often the sister projects provide results when Wikipedia doesn't (although even better would be to check title matches only). Otherwise it doesn't tell anything about how much each project contributes to filling the gaps of the other projects.
Based on the screenshot in figure 1, it seems that only one interface was being tested, without any of the suggested tweaks of https://lists.wikimedia.org/pipermail/discovery/2017-March/001464.html. Correct?
It's not clear to me what's the actual control group for the Italian Wikipedia. Did the control group continue using the traditional cross-wiki search interface?
Is the difference in ZRR between test and group in figure 3 significant? If the same behaviour is being measured, then the search queries and result rates should be similar, shouldn't they?
Have you tried normalising the rates in figure 4 by wiki (or corpus) size? My suspicion is that, absent any relevance threshold, those numbers just tell us how many words a wiki contains, and hence how likely it is to have some overlapping words in some document even if the document is irrelevant. If there is some relevance threshold, then the story would be different.
- For instance, the Wikiquote numbers make a lot of sense: the Italian Wikiquote has 40 percentage points more results than the German Wikiquote, for the very good reason that it's one order of magnitude larger.
- On the other hand, I'm not sure that size can fully explain certain results for Wiktionary, such as a difference of 14 percentage points between Russian and French Wiktionary. I doubt that even 15 % of the total searches on Wikipedia can be for word definitions, so it feels rather unlikely that that the French Wiktionary is able to provide a (relevant) word definition in 14 % cases more, especially since the size is rather comparable (only about twice as large). It's also strange that the Persian Wiktionary has a ZRR similar to that of the Polish and Italian Wiktionaries despite being one order of magnitude smaller (same for the Catalan Wiktionary and vs. the German Wiktionary): if we can assume that the search engine analysers are equally efficient on both corpuses and that there are no measuring errors and that the results being measured are equally relevant on average, then the Catalan and Persian Wiktionaries would seem to be more "efficient" or more similar to the Wikipedia content... but this would need a lot of verification.
Commons was given the most prominent spot in the search results box, but there is no mention of it that I could see. Is it counted in the number of clicks? How about the zero-results rate? It would also be interesting to see which languages find more results in Commons.
I also had some other comments but I forgot them. This is enough for now. :)

Hi @Nemo_bis

Regarding the UI concerns

Based on the screenshot in figure 1, it seems that only one interface was being tested, without any of the suggested tweaks of https://lists.wikimedia.org/pipermail/discovery/2017-March/001464.html. Correct?

Yes, this test was conducted using the initial UI. However, since then we have revised the UI to address these concerns. Visible here
http://sistersearch.wmflabs.org/w/index.php?search=rainbow&title=Special:Search&profile=default&fulltext=1
The icons have been changed to the project logos, the images have been moved to the bottom of the sidebar, and the results have been made more compact. We're still only showing one result per project because we've gotten a lot of feedback suggesting that some results could be irrelevant or unwanted, so we'd rather stick with the most relevant result for now (this idea could be worth testing though).

In T160004#3219150, @Nemo_bis wrote:

When you measure the zero-results rate for sister projects, is that the baseline for searches which users run on those wikis, or the measure of how many searches performed on Wikipedia had some match on other wikis? The end of page 4 seems to imply the latter, since it's stated that figure 4 is a breakdown of figure 3.

Yup! It allows us to see how queries tailored to Wikipedia perform on other projects.

The dotted line (if that's what you're referring to when you say "baseline") for each project calculated from all tracked searches across all languages. For example, when aggregating across all the languages of Wikibooks in our event logging data, Wikibooks' ZRR was ~17%. I'm just clarifying it in case it wasn't clear before.

Figure 4 would be more interesting if it showed how often the sister projects provide results when Wikipedia doesn't (although even better would be to check title matches only). Otherwise it doesn't tell anything about how much each project contributes to filling the gaps of the other projects.

Not the point of Fig 4 but sure, that would be interesting to see.

It's not clear to me what's the actual control group for the Italian Wikipedia. Did the control group continue using the traditional cross-wiki search interface?

It escapes my memory :\ @Jdrewniak @EBernhardson, do you remember?

Is the difference in ZRR between test and group in figure 3 significant? If the same behaviour is being measured, then the search queries and result rates should be similar, shouldn't they?

The differences aren't significant, which can be seen by the overlapping confidence intervals.

Have you tried normalising the rates in figure 4 by wiki (or corpus) size? My suspicion is that, absent any relevance threshold, those numbers just tell us how many words a wiki contains, and hence how likely it is to have some overlapping words in some document even if the document is irrelevant. If there is some relevance threshold, then the story would be different.

I think it'd be interesting to look at ZRR/size, but I didn't spend a lot of time on ZRR because it's not what we were interested in. I included it mainly as a consistency check and as a way of checking the clickthrough results. For example, if dewiki's test's group's ZRR had been abysmally larger than dewiki's controls', that might have explained what we saw in the clickthrough section.

Commons was given the most prominent spot in the search results box, but there is no mention of it that I could see. Is it counted in the number of clicks? How about the zero-results rate? It would also be interesting to see which languages find more results in Commons.

Since the multimedia results are mixed with the results from Commons (rather than being exclusively from Commons) and the data around Commons results was weird & had issues in general, I omitted the multimedia box from the analysis.

RandomDSdevel awarded a token.Jun 17 2017, 10:15 PM

	F6675027: mobile-view-sister-projects.PNG
	Mar 17 2017, 9:43 PM

[A/B Test] Add sister project search results in a right sidebar (test #2)Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

[A/B Test] Add sister project search results in a right sidebar (test #2)
Closed, ResolvedPublic
Actions

Related Objects
Search...