Search Relevance Survey test #3: action items
Open, NormalPublic

Description

Based on a meeting we held on Aug 23, 2017, we came up with various action items for running a third test:

  • Trey: review Discernatron queries for quality (https://phabricator.wikimedia.org/P5909)
  • Erik: develop backend infrastructure to support lots of queries and lots of results per query (T174387)
  • Erik: run the A/B test
  • Mikhail: do analysis of A/B test
  • Erik: extract Discernatron judgements as comparison data

(Dropped the items to add and evaluate extra non-Discernatron results as just adding extra work.)

debt created this task.Aug 24 2017, 9:37 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 24 2017, 9:37 PM

These are keepers:

'Lutjanus synagris'
7th day adventist
10-5 diameter
1978 scotland squad
2016 isl
3000 BCE
Adam Rapp
Affirmative Action (song)
anganwadi workers
Annihilation
antibacterial hand rub
ashes of the singularity
Avro Lancaster operators
balistic missile defense
bantu
battle of vallahar
beauty and the beast cost
brigthon pier
bromosporine
Cage Fights Unleashed
Caravajal
controls
corona rhythm of the night
crestron
cruikshank german derogatory meaning
de provence
Defunk
dont cry guns n roses
eczistence movie
Ellen pompoe
elton john your song
Endartertis obliterans
environmentalism
esperanza spaldin
eygyptian cuisine
fashion motifs
fees structure
focal reducer
fort worth film
frankfort school
german revolution
golf career grand slam
gordon hayward
Gotham tv
Grand Bohemian Hotel Orlando, Autograph Collection
grand slam 3
guadeloupe dance
gumboot chitons natural prediter
half adder truth table
homosexuality in the united states
how were the chinese treated on the goldfields
hydrostone halifax nova scotia
induction cooking
Information Asymmetries
Jet Star III
JFK
kyle schwaber
kyrgyzstan 2010 riots
lamppost lane
latin dative
List of hospitals in Andorra
list of the bridge 2013 episodes
llajic
lurlene waallace
Lutheranism
malassya airlines
malcom mclauren
malsyia airlines
marilyn monroe
Mary I Fergusson
masters field 2016
mercedes truck w673
midazolam dose
n64 game
naval flags
north heights junior high
number of teams participating in the Africa Women Cup of Nations
oremo
pasolini salo
Paul Simon & Art Garfunkel Overs [Live]
phasers on stun
Picture resolution ppp
pinecorn fish
polypodium species
public sign
quantitative techniques
Quiz Magic Academy
rachel banham
ramoones
randeep
red seaweed
ride bicycles
Ritchie Valens - La Bamba
saab episodes
sanilac county, new york
saving provet ryan
schools are prisons
semper fight
sour creeam ch
ss
SSB modulation
Ssuper BowL 5
starksy and hutch devil
stem vegetable
The bitter end
the family tv shows
The Gestapo's Last Orgy (1977)
the great beer flood
the killing fields
The Proposal Don DeLilo
the rythm of the night
the tale of gengi
tommorrowland 2016
trevor leee
true detecive
uncle toms
united broadband
united states v. rands
wedding day in funeraville
wetzel county wv
what does the chancellor of  the Exchequer keep in his red box
who played the older dotti hilton in the movies in the movie a leage of their own
wii u pro
workhouse coinage
zambiasi
Zondag News

"Babe" movie released on DVD
Death of a Red Heroine by Qiu Xialong
enzymes concentration
orders of IRDA about surrender
Puri temple
rape of nanking
Trust (property)

These are also keepers, but I think we should edit them in ways that don't change the results, but make them more accessible to survey respondents. I've added the edited version to the end of the list above.

, enzymes concentrationenzymes concentration
\"Babe\" movie released on DVD"Babe" movie released on DVD
Death of a Red Heroine by Qiu Xialong:Death of a Red Heroine by Qiu Xialong
ORDERS OF IRDA ABOUT SURRENDERorders of IRDA about surrender
PURI TEMPLEPuri temple
rape of nanking]rape of nanking
Trust (propertyTrust (property)

These I would vote to drop, but could be convinced otherwise:

298005 b.c.very specific year in an unspecific time, not the name of anything I could find
antonio parentif "parent" is correct, it's too non-specific; otherwise, it's an unrecoverable typo
brook valley division monroe north carolina carolinavery specific locality
LAW IN LONDONseems to refer to a specific program run by Syracuse University
Leix\u00f5es S.C.that's an õ in the middle; with it it's a different query, without it it's very confusing
what if lamba is lower than requiredhard to understand without context

These either don't seem encyclopedic, are very non-specific, are nonsensical, or would be too hard for survey takers to understand ("tayps of wlding difats")

aa rochester ny meetings
Antiochus Epiphanes was another: after his detention at Rome and attempts
compare roads built Marcos and Aquino
examples of confidential email message
highliting text and calculations in word documents
how would you test for starch?explain your answer
Pets parallelism
red room tor sites
SANTA CLAUS PRINT OUT 3D PAPERTOYS
tayps of wlding difats
top out at
treewards
Unblock.request reason isis
when was it found

If there are no objections, we can use the top list for the rest of the project.

TJones updated the task description. (Show Details)Aug 28 2017, 8:10 PM

generate list of results from existing queries (say, top 50 each?) {50 results times ~100 queries gives us ~5000 articles gathering data, up from 50}

Erik suggested just using the current documents from Discernatron. For the purpose of this exercise, that's reasonable, in that we'd have Discernatron data on all of them. I'd previously suggested that a few new results would mean we could compare any new standouts to the old data, but maybe that isn't worth the complexity for this round.

optionally manually review “stand out” results {i.e., results with good scores that do not exist in the Discernatron data}

Same here. If we don't have any new data, there's nothing here to do.

I retract the suggestion for new results and say just run with the current docs. It also solves the problem of getting new DYM results for the ones that get zero results, etc. The data we have is self-consistent and good for this round of the experiment.

Any objections?

These I would vote to drop, but could be convinced otherwise:

These either don't seem encyclopedic, are very non-specific, are nonsensical, or would be too hard for survey takers to understand ("tayps of wlding difats")

aa rochester ny meetings
Antiochus Epiphanes was another: after his detention at Rome and attempts

This is certainly encyclopedic, but its also very hard to decide on an intent. It looks like an exact quote from the book SPQR: A History of Ancient Rome and safe to remove.

compare roads built Marcos and Aquino

Arguably this is trying to compare roads built by two Philippines administrations, that of Ferdinand Marcos (dictator '65 - '86) and Corazon Aquino (president '86 - '92). Even still this query is probably much too specific to have good results anywhere.

examples of confidential email message
highliting text and calculations in word documents
how would you test for starch?explain your answer
Pets parallelism

I still want to know what this was about ... never figured it out.

red room tor sites
SANTA CLAUS PRINT OUT 3D PAPERTOYS
tayps of wlding difats

Amazing that google can turn this into 'types of welding defects'. Without hints most humans would probably never figure that out.

top out at
treewards
Unblock.request reason isis
when was it found

If there are no objections, we can use the top list for the rest of the project.

Random comments above, not really disagreeing just random thoughts. I think we are good to use the top list moving forward.

TJones added a comment.EditedAug 28 2017, 8:52 PM

Antiochus Epiphanes was another: after his detention at Rome and attempts

I dunno. What's the goal of searching a sentence fragment. Maybe on wikisource it would make sense.

compare roads built Marcos and Aquino

I guess I tend to want to drop things that look like homework questions (not attempts to find the answers, but the actual questions).

Pets parallelism

I always thought it was PETSc, which is a PDE-solving GPU thing or something, but, yeah, it's not clear.

tayps of wlding difats

I can almost see it, with aggressive, English-specific phonetic matching—it's not like there's anything else it's going to match! (English spelling is so ridiculous.)

But, yeah, figuring out this stuff can be endlessly fascinating.

One potential complication, how long do we want to run the test? This will be determined by how many impressions do we need for each page, and how many pages do we want to wait for enough impressions of?

Some numbers:
query + article combinations: 4799
unique articles: 4745

scorearticles>100 views/wk>500 views/wk>1k views/wk
==014821213881686
>03317251617071293
>116091228850651
>2488369268205
==321015612397

Or if you like pictures, here are cumulative histograms of the expected # of impressions and a guess at the number of responses (based on a response rate between 22% and 30% depending on the score):


(i have no clue what the right graph to represent this is ... but its something :P)

So, do we stay with 1k impressions per week, and do we accept that some will have less data, or do we run longer, or ?

Change 374655 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] Configure CirrusSearch human relevance survey

https://gerrit.wikimedia.org/r/374655

TJones updated the task description. (Show Details)Aug 30 2017, 2:19 PM

I trust @mpopov to be more precise in estimating desired sample size, but I'll give it a go.

100 views per week with 50% sampling and 20% response rate is 10 judgements/week. The confidence interval of a proportion maxes out when the proportion is 50%, so we can use that as an upper bound on the size of our error bars*. I used this calculator and took the continuity corrected numbers, rounded to 1%.

[* N.B. Those are the error bars on the proportions, not on any calculation we do with the proportions, but it's an easy-to-calculate metric of how accurate our input data is for Mikhail's classification models.]

# weekssample sizemax 95% C.I.
110±30%
220±22%
330±18%
440±16%
660±13%
880±11%
12120±9%
16160±8%
20200±7%
2002000±2%

So... 4 years?

Seriously, though, I'd like to see that confidence interval on the proportions below 10%. If we go for 100% sampling that does leave us open to people gaming the system, since anyone could go to one of the pages and be sure to get a survey, but it gives us twice as much data.

Assuming 100% sampling, a slightly higher response rate of 22%, and non-pessimal proportions (say 40% / 60%), then in 4 weeks we'd get 88 judgements, and a confidence interval of ±11% (for proportions of 40% or 60%). That's not ideal, but seems bearable.

We could try to do something a bit more Discernatron-like and actively engage volunteers and route them to specific pages where we need judgements—"Please read over one of the following pages that interests you. After 60 seconds, a question will pop up. Please answer it as best you can." Maybe for a future iteration...

debt updated the task description. (Show Details)Aug 30 2017, 9:11 PM

Yeah, we probably don't want to run this test for 4 years. ;)

But mentioning 4 years makes 4 weeks (or 6) seem less crazy.

Here is the source data after some preprocessing for the survey. It contains a map from article id to an object containing the article title and the discernatron scores for the queries asked on that page: P5957

It was created by starting with api output from https://discernatron.wmflabs.org/scores/all?json=1 and processing it with P5958 to figure out article ids and page view counts for deciding on sample rates.

TJones added a comment.EditedSep 5 2017, 4:58 PM

Interesting when two seemingly unrelated queries have overlapping results, even if the results aren't always relevant. Surprised it happened so often in such a small set. Nice work!

Is "extract Discernatron judgements as comparison data" done now?

@mpopov Should i trim down the list of questions asked for this next test, or keep all 4 variations? Or maybe remove some variations and add some new ones?

Change 374655 merged by 20after4:
[operations/mediawiki-config@master] Configure CirrusSearch human relevance survey

https://gerrit.wikimedia.org/r/374655

Mentioned in SAL (#wikimedia-operations) [2017-09-06T23:10:12Z] <twentyafterfour@tin> Synchronized wmf-config/: deploy config for CirrusSearch human relevancy survey. Change-Id: I272c69e5a3bb6e833fca59282142d6b237fd9e60 Bug: T174106 (duration: 00m 52s)

mpopov added a comment.EditedSep 7 2017, 4:48 PM

@mpopov Should i trim down the list of questions asked for this next test, or keep all 4 variations? Or maybe remove some variations and add some new ones?

Given Dario's remarks about phrasing from the discussion, I'd like to try (if possible):

  • Would you click on this page when searching for ‘…’?
  • Would someone click on this page when searching for ‘…’?
  • If you searched for ‘…’, would this article be a good result?
  • If someone searched for ‘…’, would this article be a good result?
  • If you searched for ‘…’, would this article be relevant?
  • If someone searched for ‘…’, would this article be relevant?
  • If you searched for ‘…’, would you want to read this article?
  • If someone searched for ‘…’, would they want to read this article?

It will be interesting to see the difference in each pair. Specifically, I have a hypothesis that the "someone" variations will yield better results than "you" ones.

Too many variations might be very difficult given our long tail of articles with low numbers of weekly page views, they wouldn't get enough data to be certain or we would have to run the test even longer than currently suggested.

mpopov added a comment.Sep 7 2017, 5:13 PM

How about…

  • 1a: Would you click on this page when searching for ‘…’?
  • 1b: Would someone click on this page when searching for ‘…’?
  • 3a: If you searched for ‘…’, would this article be relevant?
  • 3b: If someone searched for ‘…’, would this article be relevant?
  • 4: If someone searched for ‘…’, would they want to read this article?

Questions 1a and 3a were the top 2 questions in tests 1 and 2 (respectively) whose responses yielded models with highest accuracies.

I've just stumbled across this and can't find anywhere to leave feedback about it - please point me to somewhere else if you don't want it here.

I was reading the "White Ensign" article on en.wp, arriving via the "White ensign" redirect having navigated directly from the URL bar. When I scrolled down to the "History" section I got the survey question popup asking question 4 "If someone searched for 'Naval flags' would they want to read this article?" with the options "Yes", "No" and "Don't know" and I don't know how to answer as my initial thought is "yes and no". If I searched for "Naval flags" I'd possibly be interested in reading this article, but it's not what I'd want to be taken to directly - "Maritime flags" is likely the best article and if it wont disrupt your test I will probably create a redirect to it. That said, it is more obviously relevant than some of the current search results for the term ("Flags of the English Interregnum" particularly) and is the sort of thing I'd expect to find in search results for "naval flags" given the articles we appear to have and don't have (when I get time, I will research whether a "Lists of naval flags" list of lists will be worth creating).

TonyBallioni added a subscriber: TonyBallioni.EditedSep 16 2017, 6:42 PM

I'm normally on team "give the WMF a bunch of room to improve stuff". This is really confusing though. I was recently at https://en.wikipedia.org/wiki/Maxwell_House_Haggadah and received a question as to whether I would expect to find this as a search result for "schools and prisons" (or it might have been prisons and schools).

My initial response was "WTF Wikipedia. No I don't expect to find the liturgical book for the Passover Seder when I search for schools and prisons". That is still my response, and I'm a pretty active editor. You could arguably find some people who would also be offended by the connection with their cultural heritage and those search terms.

At best this is confusing for editors or readers who are new to Wikipedia (it was confusing to me as an active Wikipedian). At worst questions could be misinterpreted as being insulting depending on how they randomly appear on pages.

I don't know if there is anything that can actually be done about this, but I just wanted to let people know that it is confusing and has the potential to accidentally turn people off.

Edit: this happened again and I can confirm it was schools and prisons

This feature needs an opt-out button, i.e. "Do not show me these pop-up messages again".

We will hopefully have a blog post coming out soon helping to explain the motivation behind the survey in a more accessible manner, including the apparent mismatch between some of the queries and the articles—but that is aimed at folks who may be less technically inclined than the average person with a phabricator account.

In short, we need both good matches and bad matches in our training data if we are going to learn. We sometimes also include "high recall" articles in an attempt to see if we're missing something that's not in the top few results, because most people don't look past the first page of results (and usually not much past the first three).

@Thryduulf, here is a fine place for feedback. You are much more motivated and participatory than our average expected survey recipient—which is wonderful! We weren't expecting many people to offer good results and other helpful feedback, so thanks for that. We actually need to identify both good and bad results for training the machine learning, and we're hoping the "wisdom of the crowds" will allow us to rank the articles; the distinctions between awesome, good, meh, bad, and WTF are all useful.

The query naval flags was one of the most contentious queries when we ran it as part of the Discernatron; there was a lot of disagreement about the best result. Since we are looking at lots of results for each query, adding a redirect to a good result wouldn't mess with our survey or our statistics. We might pick it up in a future round, which would be a good thing.

@TonyBallioni, there is always the possibility of people inferring an offensive intent from semi-random automated processing. In this case, this looks like a "high recall" result—where we dig deep into the list of matching documents and bring up more or less random additional results—again, because we need both good and bad results for machine learning training data. We expect occasional "WTF" results, but we don't intend them to be offensive. From an information retrieval point of view, the match is understandable, since the words "homes, schools, senior centers, prisons" appear in the lead paragraph for the article—so the search terms are there, they are close to each other, and they are in the lead paragraph. This is the kind of bad result we need for the machine learning process. It would, of course, decrease the relative value of having all the terms close together in the lead paragraph since it's a poor result otherwise.

There's no way to gather this kind of data without the possibility of these randomly unpleasant juxtapositions occurring. We would try to filter out inherently offensive queries, but it's a manual job and we can't also filter possibly offensive query-article pairs like this. Can we mitigate the problem by adding a link to an explanation of the purpose of the survey (in addition to the link to the privacy policy we already have)? Hopefully we could adequately explain the intent of the survey and the semi-randomness of the results in a way that would explain that certainly no offense was intended.

@Jonesey95, an opt-out button is probably a good idea. I don't know the technical details off the top of my head, but I think this might be easier for logged in users than not-logged-in users for privacy reasons, but I'm guessing you would suggest it for not-logged-in users, too.

I believe this round of the survey, which is limited in scope, only has a few days left. We'll be reviewing the quality of the data we gathered, looking at technical glitches (including getting the same survey multiple times, which shouldn't be happening), and thinking about enhancements (like an opt-out button and an "more info" link).

Thanks for taking the time to provide feedback; it is much appreciated!

@TJones Thanks for the response, the situation now is:
*Naval flag is a disambiguation page listing Maritime flag, Naval ensign and Naval jack with a see also to the new Lists of naval flags (it was previously a redirect to Maritime flag).
*Naval flags redirects to the naval flag disambiguation
*Lists of naval flags is a new list of lists (anyone with knowledge of the topic is encouraged to expand this!)
*List of naval flags redirects to the list of lists.

Change 378950 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@master] Javascript timestamps are in ms, not s

https://gerrit.wikimedia.org/r/378950

debt added a subscriber: TheDJ.Sep 19 2017, 6:31 PM

@TheDJ and @EBernhardson found and fixed a bug (asked about here):

Javascript timestamps are in ms, not s
The timeout was intended to be set to 2 days, but due to a mixup between millis and seconds it was only set to about 3 minutes.
Additionally add an early exit for navigator.doNotTrack, as event logging won't send the events anyways.

debt updated the task description. (Show Details)Sep 19 2017, 6:34 PM

The blog post that gives a high-level and more organized overview of the survey purpose and goals is up now.

I've also copied the key points from the discussion here to the EPIC-level task T171740#3619075. Please comment if I missed anything.

@Thryduulf —thanks for all the edits! At least one good thing definitely came out of this experiment!

Change 378950 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Javascript timestamps are in ms, not s

https://gerrit.wikimedia.org/r/378950

Change 379118 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@wmf/1.30.0-wmf.18] Javascript timestamps are in ms, not s

https://gerrit.wikimedia.org/r/379118

Change 379119 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@wmf/1.30.0-wmf.19] Javascript timestamps are in ms, not s

https://gerrit.wikimedia.org/r/379119

Change 379118 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@wmf/1.30.0-wmf.18] Javascript timestamps are in ms, not s

https://gerrit.wikimedia.org/r/379118

Change 379119 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@wmf/1.30.0-wmf.19] Javascript timestamps are in ms, not s

https://gerrit.wikimedia.org/r/379119

Mentioned in SAL (#wikimedia-operations) [2017-09-19T23:45:17Z] <thcipriani@tin> Synchronized php-1.30.0-wmf.19/extensions/WikimediaEvents/modules/ext.wikimediaEvents.humanSearchRelevance.js: SWAT: [[gerrit:379119|Javascript timestamps are in ms, not s]] T174106 (duration: 00m 50s)

Mentioned in SAL (#wikimedia-operations) [2017-09-19T23:46:42Z] <thcipriani@tin> Synchronized php-1.30.0-wmf.18/extensions/WikimediaEvents/modules/ext.wikimediaEvents.humanSearchRelevance.js: SWAT: [[gerrit:379118|Javascript timestamps are in ms, not s]] T174106 (duration: 00m 48s)