[Epic] Search Relevance: graded by humans
Open, NormalPublic

Description

We want to do a series of tests requesting feedback from users that are viewing article pages after viewing a curated list of queries, to be tested on English Wikipedia and is expected to get approximately 1,000 impressions per week using the 'mediawiki.notification' function.

For our MVP (minimum viable product) test, we will be hard-coding a list of queries and articles into the code (using Javascript). This will allow for a small scale evaluation to see if this type of data is useful or if we receive just a bunch of 'noise' from the test.

Some of the initial hardcoded queries are (for MVP only):

  • 'sailor soldier tinker spy'
  • 'what is a genius iq?'
  • 'who is v for vendetta?'
  • 'why is a baby goat a kid?'

The test will contain:

  • a link to the privacy policy: https://wikimediafoundation.org/wiki/Privacy_policy
  • 3 selector buttons that a user can choose: yes, no, I don't know
  • ability for the user to dismiss the notification/question
  • ability for the user to scroll and the notification/question box does not impede reading of the article
  • an auto-timeout to dismiss the notification/question box automatically
  • the ability to only select one option before the notification/question box is dismissed

We will track:

  • what option the user selected
  • if the notification/question box get dismissed by the user
  • if the notification/question box get dismissed automatically (it's session timed out without any interaction from the user)

Additional test options to consider:

  • should we embed the desired queries into cached page render
  • should we use a graphic (smiley face, frowny face, unsure face) instead of the yes/no/not sure text
  • should we test on other language wiki's
    • would require translating all text

First draft of notification/question box:


Sample smiley face option that could be used in a future test to avoid 'wall of text':

https://gerrit.wikimedia.org/r/#/c/366318

debt created this task.Jul 26 2017, 1:35 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 26 2017, 1:35 PM

Change 366318 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@master] MVP of human graded search relevance on article pages

https://gerrit.wikimedia.org/r/366318

Change 366318 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] MVP of human graded search relevance on article pages

https://gerrit.wikimedia.org/r/366318

I think a big part of this is going to be figuring out if we can translate the survey responses into grades that correlate with expert judgements (provided here by @TJones ). I realize now those were never posted, they were just part of an email. Adding now:

who is v for vendetta?

okV for Vendetta (film)
okV for Vendetta
goodList of V for Vendetta characters
bestV (comics)
badVendetta Pro Wrestling

star and stripes

okStars and Stripes Forever (disambiguation)
badThe White Stripes
badTars and Stripes
okThe Stars and Stripes Forever
badStripes (film)

Typo for “stars and stripes”. DYM gives the right spelling.

block buster

bestBlockbuster
goodBlock Buster!
badThe Sweet (album)
okBlock Busters
badBuster Keaton

how do flowers bloom?

badBritain in Bloom
v.badFlowers in the Attic (1987 film)
bestFlower
v.badThymaridas
v.badFlowers in the Attic

search engine

bestWeb search engine
goodList of search engines
okSearch engine optimization
okSearch engine marketing
okAudio search engine

I think “web search engine” is only ok, but it has a redirect from “search engine” so the community thinks it is the best.

yesterday beetles

v.badPrivate language argument
v.badDiss (music)
v.badHow Do You Sleep? (John Lennon song)
v.badMaria Mitchell Association
v.badThe Collected Stories of Philip K. Dick

Typo for “yesterday beatles”—there are no good results here. (DYM gives the right spelling.)

sailor soldier tinker spy

bestTinker Tailor Soldier Spy
goodTinker, Tailor
okBlanket of Secrecy
okList of fictional double agents
okIan Bannen

Typo for “Tinker Tailor Soldier Spy”.

10 items or fewer

okFewer vs. less
v.bad10-foot user interface
v.badMagic item (Dungeons & Dragons)
v.badItem-item collaborative filtering
v.badItem 47

No real good result here. “10 items of less” might be of interest, but it’s not going to show up.

why is a baby goat a kid?

bestGoat
v.badSuper Why!
v.badBarney & Friends (redirect from Barney Is A Dinosaur)
v.badThe Kids from Room 402
v.badOliver Hardy filmography

what is a genius iq?

goodGenius
bestIQ classification
badGenius (website)
okHigh IQ society
badSocial IQ score of bacteria

@mpopov—the graphs look good. As mentioned on IRC, percentages or some other normalization would be helpful in figuring out the best response rates among the question formats and comparing yes/no/etc. rates among answers.

By eye, it looks like "would they want to read this article" gets slightly more engagement, and "would this article be relevant" and "would you click on this page" get slightly less, but I wouldn't be surprised if they were all statistically indistinguishable. I wonder if the question format has any effect on yes/no ratios, too. There may not be enough data to tell, though.

@EBernhardson, thanks for posting my "expert judgements"—ha! They are at least "somewhat considered judgements".

No real good result here. “10 items of less” might be of interest, but it’s not going to show up.

Thanks also for kindly reproducing my original typo here.. "10 items or less" is what it should be. ;)

Erik pointed out that people don't like Ian Bannen (actor in the 1970s version of Tinker Tailor Soldier Spy) very much, but if you go by a simple ratio of yes/no votes, he still comes in 3rd, which is reasonable. (Ha! I just got the survey while looking at his page. It seemed only fair to dismiss it, though I wanted to vote yes.)

I think the results are promising. In places where the wisdom of the crowd disagrees with me, I think the results are understandable. For example, yesterday beetles gets all horrible results. But the least horrible is a different John Lennon song. That is at least tangentially related—it's a bad result, but it is also the best result.

I also wonder if the timeout proportion is a useful signal, or even a lack of responses (that points to a lack of popularity for the results page, at least). Seems possible, but it's not immediately clear how to use them.

@mpopov: two questions—an easy one and a hard one.

  • Easy: how are "i don't know" votes counted; is that the same as "dismissed"? It might be useful to distinguish them, since "dismissed" seems to mean "I don't want to help" and "I don't know" seems to mean "I'd like to help but can't". The fact that a judgement is difficult might be meaningful.
  • Harder: Do you think we'll be able to do reasonable classification of the results (best/good/ok/bad/v.bad, or even just best/good/bad), or will we be limited to ranked order? Ranked order is still useful, though I think categorized would be better. (I'm also curious what you think the right model would be for doing the categorization. I find it very interesting...)

I'm looking forward to the results of the 60s vs 60ms delay in showing the survey. If there's a marked improvement in quality and a marked decrease in engagement, we should try again with a 30s delay to see if we get a better balance.

Possibly related to the short delay, I see what might be a familiarity bias: people probably know Buster Keaton is not really related to "block buster" without having to read much, but they are less sure with The Sweet (Album). OTOH, it could be a transparency bias—the Buster in Buster Keaton is clearly the main reason for the match, and so it's obviously wrong. Why The Sweet (album) matched is not immediately obvious.

During today's Wednesday search meeting, we talked a bit about the survey. Since queries we use if we deploy this for real will have to be vetted by humans (like Discernatron queries have been), we aren't limited to the 90 day retention window. (We should also be able to share the queries and vote results, too, as with Discernatron.) However, I was thinking that we should take queries in batches, and not turn them into training data until some high proportion (90%? 95%? 98%?) of the batch have gotten enough votes to use. We don't know whether the difference between uncommon queries and popular/unpopular pages is just quantitative (e.g., popularity of the result page) or qualitative (i.e., they really are different somehow and so would affect training). So taking the "easy" part of the batch first could skew training in some unpredictable way. </2¢>

Due to a bug, 'I don't know' responses were not collected in this first version of the survey. The second iteration which includes the 60s delay will have them labeled 'unsure'.

@mpopov—the graphs look good. As mentioned on IRC, percentages or some other normalization would be helpful in figuring out the best response rates among the question formats and comparing yes/no/etc. rates among answers.

By eye, it looks like "would they want to read this article" gets slightly more engagement, and "would this article be relevant" and "would you click on this page" get slightly less, but I wouldn't be surprised if they were all statistically indistinguishable. I wonder if the question format has any effect on yes/no ratios, too. There may not be enough data to tell, though.

Erik pointed out that people don't like Ian Bannen (actor in the 1970s version of Tinker Tailor Soldier Spy) very much, but if you go by a simple ratio of yes/no votes, he still comes in 3rd, which is reasonable. (Ha! I just got the survey while looking at his page. It seemed only fair to dismiss it, though I wanted to vote yes.)

I think the results are promising. In places where the wisdom of the crowd disagrees with me, I think the results are understandable. For example, yesterday beetles gets all horrible results. But the least horrible is a different John Lennon song. That is at least tangentially related—it's a bad result, but it is also the best result.

I also wonder if the timeout proportion is a useful signal, or even a lack of responses (that points to a lack of popularity for the results page, at least). Seems possible, but it's not immediately clear how to use them.

Here are the versions with proportions instead. % yes is #yes / (#yes + #no) (likewise for % no); % dismissed is #dismiss / (#yes + #no + #dismiss)

  • Harder: Do you think we'll be able to do reasonable classification of the results (best/good/ok/bad/v.bad, or even just best/good/bad), or will we be limited to ranked order? Ranked order is still useful, though I think categorized would be better. (I'm also curious what you think the right model would be for doing the categorization. I find it very interesting...)

I'm working on that now :D In the meantime, here's how you compared to the survey takers:

Hey, there does appear to be some agreement there :) so that's promising!

Interesting how the more relevant you think an article is, the less engagement we saw with survey 1.

Neat stuff! Thanks!

@TJones @EBernhardson: I'm done with 1st set of survey responses if you want to take a look: https://people.wikimedia.org/~bearloga/reports/search-surveys.html also I've got 2 possible scoring systems for the 2nd set that has "I don't know" and I'd love to know what you think of them and if you have ideas for alternatives

debt added a comment.Aug 22 2017, 1:29 PM
This comment was removed by debt.
debt added a comment.Thu, Aug 24, 3:53 PM

Sample image of what the test looks like:

debt added a comment.Thu, Aug 24, 5:54 PM

@mpopov gave a presentation at the Research Group meeting on Aug 24, 2017, here are some of the notes that were taken:

  • Judging relevance from human graders
  • Goal: predict article relevance using aggregated public opinion
  • Method:
    • Ten example queries, and used the top 5 articles returned for each query
      • Gold standard dataset based on expert judgements (Trey and Erik, in this case)
      • Surveyed users, asking four questions about the page and the search that would have it as a top 5 article
      • Q1: Would you click this page when searching for "…"?
      • Q2: If you searched for "…", would this article be a good result?
      • Q3: If you searched for "…", woudl this article be relevant?
      • Q4: If someone searched for "…", would they want to read this article?
    • Two tests:
      • Test 1: Immediate survey pop-up, only yes/no answers recorded ("I don't know" was an option, but not recorded)
      • Test 2: 60 second delay, "I don't know" answers encoded as "unsure"
    • Results:
      • Positive slope between relevance and positive responses from the surveys, suggesting that this information could be use for training a model
      • Trained several models (logistic regressions, random foreste, neural networks, naive bayes, xgboost)
  • Questions
    • Can you give us a quick update on how queries with non matching keywords work?
      • All keywords have to match, we don't currently even allow a "mostly" match, it's as if the word AND was put between each token
    • What do users see in the survey?
    • How were the top 5 articles determined? Using the current search engine?
      • Yes, these results came from the current search engine
    • Question wording: "relevant to you" vs. "relevant to people" <-- primes the respondent differently, could yield different results.
    • Was this only tested with logged in users? (sounds like it was, if we used Notifications to deliver the survey?)
      • This was tested against anonymous users, not using Echo notifications but a javascript functionality called 'mw.notification'
    • got it. thanks!

Mentioned in SAL (#wikimedia-operations) [2017-09-08T23:46:22Z] <ebernhardson@tin> Synchronized wmf-config/CirrusSearch-rel-survey.php: T171740: Fix inverted sampling rates for human relevance survey (duration: 00m 47s)

Mentioned in SAL (#wikimedia-operations) [2017-09-09T01:12:54Z] <ebernhardson@tin> Synchronized php-1.30.0-wmf.17/extensions/WikimediaEvents/modules/ext.wikimediaEvents.humanSearchRelevance.js: T171740: Reduce annoyance of survey by enforcing minimum 2 days between showing survey to same browser (duration: 00m 46s)

Change 377014 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@master] Try even harder to not show survey multiple times

https://gerrit.wikimedia.org/r/377014

Change 377014 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Try even harder to not show survey multiple times

https://gerrit.wikimedia.org/r/377014

This seems like an awkward way to conduct a search relevance test. I certainly don't expect to see a pop-up about search relevance when reading an article and frankly it feels intrusive and distracting, bordering on annoying. I imagine for some people, it's also confusing.

Instead, why not ask users about the relevance of search results on the search results page itself? There it would have clear context and would also allow users to give feedback on multiple pages at once. Just my 2 cents. Feel free to ignore :)

EBernhardson added a comment.EditedTue, Sep 19, 2:18 AM

This seems like an awkward way to conduct a search relevance test. I certainly don't expect to see a pop-up about search relevance when reading an article and frankly it feels intrusive and distracting, bordering on annoying. I imagine for some people, it's also confusing.

Instead, why not ask users about the relevance of search results on the search results page itself? There it would have clear context and would also allow users to give feedback on multiple pages at once. Just my 2 cents. Feel free to ignore :)

Data on the results users are looking at is not particularly actionable. There will be a blog post soon detailing the purpose here, but the high level concept is we need to aggregate together numerous user responses to have confidence in a particular relevance judgement. We already have the ability to do this on high volume queries via click logs combined with statistical modelling of user behaviour, but we have no ability to get that information for long tail queries (roughly defined as those which are issued less than 10 times per 90 days) which make up ~60% of search traffic. This is designed to specifically address the long tail of search queries which we currently have no reasonable way to collect relevance judgements for.

To address the annoying problem we currently have a limit in place which should prevent showing the survey more than once every few days, per browser. The overall sampling rates are set to approximately 1 in 1000 page views, although some pages have higher sampling than others. If the data looks reasonable and we run further crowd sourced data collection in the future we will likely add a longer term (~permanent) opt-out behavior. Future tests should additionally be able to use lower sampling rates, as we are currently evaluating the difference between 4 different formulations of the question.

This is designed to specifically address the long tail of search queries which we currently have no reasonable way to collect relevance judgements for.

That makes sense, although it seems like a good problem for Mechanical Turk. Anyway, glad to know that you're actively trying to minimize the annoying factor :)

FYI, there's already been like 4 questions about this on WP:VP/T. Which is a LOT relatively speaking (esp. if you consider that most people won't find their way to WP:VP/T usually.

Also today @MauryMarkowitz reports:

Every time I visit the H2S radar page, a pop-up repeatedly appears in the upper right corner of the browser window asking me if this is a suitable article if one is searching for "Lancaster operators". I answer No, trying to be helpful. Then it asks me again. And again. And again. Is this something en.wiki is doing, or is this perhaps a 3rd party plugin? Anyone know what this is? Maury Markowitz (talk) 13:22, 19 September 2017 (UTC)

More feedback:
https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Mysterious_random_popup

The complaint that stood out to me most is:

" — there's a pop-up — how did that get here — it seems to be from Wikiepdia itself — it's asking me a question, I'll try to be helpful — it says 'Would you click on this page ...' — would I ever click on a page? — why might I click on a page? —" and before I could come up with a sensible answer, it vanished again. If this is intended as a way of getting useful feedback from users, it's doomed to failure

it seems like a good problem for Mechanical Turk. Anyway, glad to know that you're actively trying to minimize the annoying factor :)

@kaldari: Yeah, it does—which is why @EBernhardson built our own Mechanical Turk—called the Discernatron! The problem is that using it is really, really tedious. Given the constraints that we can't afford to pay people (so no Mechanical Turk) and a limited number of volunteers (getting the word out is hard), the plan for the Discernatron was to get people to look at a query, then review a bunch of possible results and rate them. It's extra tedious if you don't know a lot about what the query is talking about, or if you don't know what the article is about.

The survey solves a couple of those problems. More people know the survey exists because it comes to them instead of them finding it (though this can also be annoying). People who are on an article page are more likely to have some idea what that article is about than if they saw the article title and snippet on a search results page or in the Discernatron. There's still the problem of figuring out what the heck the query is supposed to mean, which is not always trivial. Alas.

I've written a blog post that gives a high-level and somewhat more organized overview of the project and what we hope to get out of it, if anyone else is interested.

For future iterations of the survey, some thoughts to consider. From the discussion @TheDJ also linked to above:

  • Making it look like a popup is really annoying to some people.
  • 30 seconds isn't long enough for some people to engage with the survey.
  • We need an opt-out mechanism.
  • "high recall" query/article pairings can be more annoying or confusing because they don't make sense to the reader.
    • This applies to some of the really "deep" results from the Discernatron data, but my guess is that it applies to some results in the top 20, too, especially when there just aren't many good results.
  • We should probably look at the Article Feedback Tool—and its Talk page archives—and try to avoid the mistakes made on that project.
  • We should consider not showing the same user the same query on the same article, ever. If someone is working on an article, say, and every two days they got the same survey on that article, that would be a different kind of annoying. (If an opt-out were available, this is when they'd use it.) OTOH, this may have technical challenges—we probably don't want to cram local storage full of a list of articles you've been surveyed about. Maybe something in prefs for logged-in users?

Over in T174106 a few more points are made, starting around T174106#3599816:

  • "high recall" results (or any not so good results) can be offensive because of the article/query pairing. This is not tractable to prevent because of volume and because of cultural differences between potential reviewers and readers. However, documentation could help.
  • yeah, on that opt-out button... sounds like it'd be popular! :P

Additional thoughts on the above:

  • Some of these problems may go away if we have a different UI. Popups are annoying, banners are differently annoying.
    • So maybe a survey box under the info box, as Erik suggested in passing earlier today, might obviate the need for an opt-out since it would be less intrusive.
    • It might also solve the problem of tracking to avoid repeat surveys because it's less intrusive.
    • A constant survey box doesn't necessarily have to have a timeout, which solves the problem of figuring out the optimal timeout.
  • I haven't read through much of the Article Feedback Tool archive, but the issue of quality is brought up right away. I think our A/B tests address that and the first one showed that it works now, with random people—though as it becomes more widely known, we could have vandals.

An obvious link to documentation, starting with a "what is this?", and a link to somewhere to leave feedback (optional) about the survey would be useful (I forget how I found the phab ticket, but it did involve a google search).
A fourth option to click on - "I want to answer in more words" would be great for people like me, but it would need to avoid the issues of the article feedback tool (alas I can't offer any suggestions how to do this off the top of my head).

TJones added a comment.EditedWed, Sep 20, 1:59 PM

Another glitch reported on the Village Pump: someone got a survey while on the diff page for the Manual of Style. It seems unlikely that the MoS would get a survey, but we definitely shouldn't have surveys on diff pages. Is it possible this was some sort of weird race condition and the survey was from an earlier page they had navigated away from?

Edit: A link to the exact diff has been provided! What a weird place for a survey to pop up.