[Epic] Search Relevance: graded by humans
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	debt
	Jul 26 2017, 1:35 PM

Description

We want to do a series of tests requesting feedback from users that are viewing article pages after viewing a curated list of queries, to be tested on English Wikipedia and is expected to get approximately 1,000 impressions per week using the 'mediawiki.notification' function.

For our MVP (minimum viable product) test, we will be hard-coding a list of queries and articles into the code (using Javascript). This will allow for a small scale evaluation to see if this type of data is useful or if we receive just a bunch of 'noise' from the test.

Some of the initial hardcoded queries are (for MVP only):

'sailor soldier tinker spy'
'what is a genius iq?'
'who is v for vendetta?'
'why is a baby goat a kid?'

The test will contain:

a link to the privacy policy: https://wikimediafoundation.org/wiki/Privacy_policy
3 selector buttons that a user can choose: yes, no, I don't know
ability for the user to dismiss the notification/question
ability for the user to scroll and the notification/question box does not impede reading of the article
an auto-timeout to dismiss the notification/question box automatically
the ability to only select one option before the notification/question box is dismissed

We will track:

what option the user selected
if the notification/question box get dismissed by the user
if the notification/question box get dismissed automatically (it's session timed out without any interaction from the user)

Additional test options to consider:

should we embed the desired queries into cached page render
should we use a graphic (smiley face, frowny face, unsure face) instead of the yes/no/not sure text
should we test on other language wiki's
- would require translating all text

First draft of notification/question box:

updated screenshot of relevance question (1×1 px, 356 KB)

Sample smiley face option that could be used in a future test to avoid 'wall of text':

user_satisfaction_survey-google_translate.png (246×344 px, 31 KB)

https://gerrit.wikimedia.org/r/#/c/366318

Details

	Subject	Repo	Branch	Lines +/-
	Try even harder to not show survey multiple times	mediawiki/extensions/WikimediaEvents	master	+21 -13
	MVP of human graded search relevance on article pages	mediawiki/extensions/WikimediaEvents	master	+358 -3

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T174064 [FY 2017-18 Objective] Implement advanced search methodologies
Resolved	Gehel	T171740 [Epic] Search Relevance: graded by humans
Resolved	debt	T171741 Search Relevance: MVP test (turned on)
Resolved	debt	T171742 Search Relevance: MVP test (turn it off)
Invalid	None	T174106 Search Relevance Survey test #3: action items
Resolved	EBernhardson	T174387 relevance survey: develop backend infrastructure to support lots of queries and lots of results per query
Resolved	EBernhardson	T175046 Search Relevance Survey test #3: turn on test
Resolved	EBernhardson	T175047 Search Relevance Survey test #3: turn off test
Resolved	mpopov	T175048 Search Relevance Survey test #3: analysis of test
Resolved	mpopov	T178096 Make a Puppet profile/role for doing R-based heavy stats/ML on Wikimedia Cloud
Declined	None	T176278 Cleanup localstorage key after relevance survey's are complete
Invalid	None	T176428 Search Relevance test #4 - action items
Declined	None	T183027 Analysis: search relevance test #4
Invalid	None	T178006 Search Relevance test #5: are users happy with the search results they got?
Declined	None	T183028 Analysis: search relevance test #5

Event Timeline

debt created this task.Jul 26 2017, 1:35 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 26 2017, 1:35 PM

debt created subtask T171741: Search Relevance: MVP test (turned on).Jul 26 2017, 1:38 PM

debt added a project: Epic.

debt created subtask T171742: Search Relevance: MVP test (turn it off).

debt moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

Change 366318 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@master] MVP of human graded search relevance on article pages

https://gerrit.wikimedia.org/r/366318

gerritbot added a project: Patch-For-Review.Jul 26 2017, 10:29 PM

Change 366318 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] MVP of human graded search relevance on article pages

https://gerrit.wikimedia.org/r/366318

ReleaseTaggerBot added a project: MW-1.30-release-notes (WMF-deploy-2017-08-01_(1.30.0-wmf.12)).Jul 27 2017, 9:00 AM

Trizek-WMF subscribed.Aug 1 2017, 4:13 PM

Results from the first test:

sailor_soldier_tinker_spy.png (2×4 px, 374 KB)

how_do_flowers_bloom_.png (2×4 px, 370 KB)

what_is_a_genius_iq_.png (2×4 px, 385 KB)

who_is_v_for_vendetta_.png (2×4 px, 385 KB)

why_is_a_baby_goat_a_kid_.png (2×4 px, 387 KB)

P.S. Code up at https://github.com/wikimedia-research/Discovery-Search-Adhoc-SurveyMVP

I think a big part of this is going to be figuring out if we can translate the survey responses into grades that correlate with expert judgements (provided here by @TJones ). I realize now those were never posted, they were just part of an email. Adding now:

who is v for vendetta?

ok	V for Vendetta (film)
ok	V for Vendetta
good	List of V for Vendetta characters
best	V (comics)
bad	Vendetta Pro Wrestling

star and stripes

ok	Stars and Stripes Forever (disambiguation)
bad	The White Stripes
bad	Tars and Stripes
ok	The Stars and Stripes Forever
bad	Stripes (film)

Typo for “stars and stripes”. DYM gives the right spelling.

block buster

best	Blockbuster
good	Block Buster!
bad	The Sweet (album)
ok	Block Busters
bad	Buster Keaton

how do flowers bloom?

bad	Britain in Bloom
v.bad	Flowers in the Attic (1987 film)
best	Flower
v.bad	Thymaridas
v.bad	Flowers in the Attic

search engine

best	Web search engine
good	List of search engines
ok	Search engine optimization
ok	Search engine marketing
ok	Audio search engine

I think “web search engine” is only ok, but it has a redirect from “search engine” so the community thinks it is the best.

yesterday beetles

v.bad	Private language argument
v.bad	Diss (music)
v.bad	How Do You Sleep? (John Lennon song)
v.bad	Maria Mitchell Association
v.bad	The Collected Stories of Philip K. Dick

Typo for “yesterday beatles”—there are no good results here. (DYM gives the right spelling.)

sailor soldier tinker spy

best	Tinker Tailor Soldier Spy
good	Tinker, Tailor
ok	Blanket of Secrecy
ok	List of fictional double agents
ok	Ian Bannen

Typo for “Tinker Tailor Soldier Spy”.

10 items or fewer

ok	Fewer vs. less
v.bad	10-foot user interface
v.bad	Magic item (Dungeons & Dragons)
v.bad	Item-item collaborative filtering
v.bad	Item 47

No real good result here. “10 items of less” might be of interest, but it’s not going to show up.

why is a baby goat a kid?

best	Goat
v.bad	Super Why!
v.bad	Barney & Friends (redirect from Barney Is A Dinosaur)
v.bad	The Kids from Room 402
v.bad	Oliver Hardy filmography

what is a genius iq?

good	Genius
best	IQ classification
bad	Genius (website)
ok	High IQ society
bad	Social IQ score of bacteria

@mpopov—the graphs look good. As mentioned on IRC, percentages or some other normalization would be helpful in figuring out the best response rates among the question formats and comparing yes/no/etc. rates among answers.

By eye, it looks like "would they want to read this article" gets slightly more engagement, and "would this article be relevant" and "would you click on this page" get slightly less, but I wouldn't be surprised if they were all statistically indistinguishable. I wonder if the question format has any effect on yes/no ratios, too. There may not be enough data to tell, though.

@EBernhardson, thanks for posting my "expert judgements"—ha! They are at least "somewhat considered judgements".

No real good result here. “10 items of less” might be of interest, but it’s not going to show up.

Thanks also for kindly reproducing my original typo here.. "10 items or less" is what it should be. ;)

Erik pointed out that people don't like Ian Bannen (actor in the 1970s version of Tinker Tailor Soldier Spy) very much, but if you go by a simple ratio of yes/no votes, he still comes in 3rd, which is reasonable. (Ha! I just got the survey while looking at his page. It seemed only fair to dismiss it, though I wanted to vote yes.)

I think the results are promising. In places where the wisdom of the crowd disagrees with me, I think the results are understandable. For example, yesterday beetles gets all horrible results. But the least horrible is a different John Lennon song. That is at least tangentially related—it's a bad result, but it is also the best result.

I also wonder if the timeout proportion is a useful signal, or even a lack of responses (that points to a lack of popularity for the results page, at least). Seems possible, but it's not immediately clear how to use them.

@mpopov: two questions—an easy one and a hard one.

Easy: how are "i don't know" votes counted; is that the same as "dismissed"? It might be useful to distinguish them, since "dismissed" seems to mean "I don't want to help" and "I don't know" seems to mean "I'd like to help but can't". The fact that a judgement is difficult might be meaningful.
Harder: Do you think we'll be able to do reasonable classification of the results (best/good/ok/bad/v.bad, or even just best/good/bad), or will we be limited to ranked order? Ranked order is still useful, though I think categorized would be better. (I'm also curious what you think the right model would be for doing the categorization. I find it very interesting...)

I'm looking forward to the results of the 60s vs 60ms delay in showing the survey. If there's a marked improvement in quality and a marked decrease in engagement, we should try again with a 30s delay to see if we get a better balance.

Possibly related to the short delay, I see what might be a familiarity bias: people probably know Buster Keaton is not really related to "block buster" without having to read much, but they are less sure with The Sweet (Album). OTOH, it could be a transparency bias—the Buster in Buster Keaton is clearly the main reason for the match, and so it's obviously wrong. Why The Sweet (album) matched is not immediately obvious.

During today's Wednesday search meeting, we talked a bit about the survey. Since queries we use if we deploy this for real will have to be vetted by humans (like Discernatron queries have been), we aren't limited to the 90 day retention window. (We should also be able to share the queries and vote results, too, as with Discernatron.) However, I was thinking that we should take queries in batches, and not turn them into training data until some high proportion (90%? 95%? 98%?) of the batch have gotten enough votes to use. We don't know whether the difference between uncommon queries and popular/unpopular pages is just quantitative (e.g., popularity of the result page) or qualitative (i.e., they really are different somehow and so would affect training). So taking the "easy" part of the batch first could skew training in some unpredictable way. </2¢>

Due to a bug, 'I don't know' responses were not collected in this first version of the survey. The second iteration which includes the 60s delay will have them labeled 'unsure'.

debt closed subtask T171741: Search Relevance: MVP test (turned on) as Resolved.Aug 17 2017, 6:26 PM

In T171740#3528351, @TJones wrote:

@mpopov—the graphs look good. As mentioned on IRC, percentages or some other normalization would be helpful in figuring out the best response rates among the question formats and comparing yes/no/etc. rates among answers.

By eye, it looks like "would they want to read this article" gets slightly more engagement, and "would this article be relevant" and "would you click on this page" get slightly less, but I wouldn't be surprised if they were all statistically indistinguishable. I wonder if the question format has any effect on yes/no ratios, too. There may not be enough data to tell, though.

Erik pointed out that people don't like Ian Bannen (actor in the 1970s version of Tinker Tailor Soldier Spy) very much, but if you go by a simple ratio of yes/no votes, he still comes in 3rd, which is reasonable. (Ha! I just got the survey while looking at his page. It seemed only fair to dismiss it, though I wanted to vote yes.)

I think the results are promising. In places where the wisdom of the crowd disagrees with me, I think the results are understandable. For example, yesterday beetles gets all horrible results. But the least horrible is a different John Lennon song. That is at least tangentially related—it's a bad result, but it is also the best result.

I also wonder if the timeout proportion is a useful signal, or even a lack of responses (that points to a lack of popularity for the results page, at least). Seems possible, but it's not immediately clear how to use them.

Here are the versions with proportions instead. % yes is #yes / (#yes + #no) (likewise for % no); % dismissed is #dismiss / (#yes + #no + #dismiss)

how_do_flowers_bloom_.png (2×4 px, 396 KB)

sailor_soldier_tinker_spy.png (2×4 px, 397 KB)

what_is_a_genius_iq_.png (2×4 px, 405 KB)

who_is_v_for_vendetta_.png (2×4 px, 409 KB)

why_is_a_baby_goat_a_kid_.png (2×4 px, 410 KB)

Harder: Do you think we'll be able to do reasonable classification of the results (best/good/ok/bad/v.bad, or even just best/good/bad), or will we be limited to ranked order? Ranked order is still useful, though I think categorized would be better. (I'm also curious what you think the right model would be for doing the categorization. I find it very interesting...)

I'm working on that now :D In the meantime, here's how you compared to the survey takers:

Hey, there does appear to be some agreement there :) so that's promising!

Interesting how the more relevant you think an article is, the less engagement we saw with survey 1.

Neat stuff! Thanks!

@TJones @EBernhardson: I'm done with 1st set of survey responses if you want to take a look: https://people.wikimedia.org/~bearloga/reports/search-surveys.html also I've got 2 possible scoring systems for the 2nd set that has "I don't know" and I'd love to know what you think of them and if you have ideas for alternatives

debt added a comment.Aug 22 2017, 1:29 PM

This comment was removed by debt.

Sample image of what the test looks like:

@mpopov gave a presentation at the Research Group meeting on Aug 24, 2017, here are some of the notes that were taken:

Judging relevance from human graders
- Slides for meeting: https://docs.google.com/a/wikimedia.org/presentation/d/1PuOOSukPYFGWikppGmw9Cg85fay-IT9Fi8gB8-CUda8/edit?usp=sharing
- Report: https://people.wikimedia.org/~bearloga/reports/search-surveys.html (work in progress)
Goal: predict article relevance using aggregated public opinion
Method:
- Ten example queries, and used the top 5 articles returned for each query
  - Gold standard dataset based on expert judgements (Trey and Erik, in this case)
  - Surveyed users, asking four questions about the page and the search that would have it as a top 5 article
  - Q1: Would you click this page when searching for "…"?
  - Q2: If you searched for "…", would this article be a good result?
  - Q3: If you searched for "…", woudl this article be relevant?
  - Q4: If someone searched for "…", would they want to read this article?
- Two tests:
  - Test 1: Immediate survey pop-up, only yes/no answers recorded ("I don't know" was an option, but not recorded)
  - Test 2: 60 second delay, "I don't know" answers encoded as "unsure"
- Results:
  - Positive slope between relevance and positive responses from the surveys, suggesting that this information could be use for training a model
  - Trained several models (logistic regressions, random foreste, neural networks, naive bayes, xgboost)

Questions
- Can you give us a quick update on how queries with non matching keywords work?
  - All keywords have to match, we don't currently even allow a "mostly" match, it's as if the word AND was put between each token
- What do users see in the survey?
- How were the top 5 articles determined? Using the current search engine?
  - Yes, these results came from the current search engine
- Question wording: "relevant to you" vs. "relevant to people" <-- primes the respondent differently, could yield different results.
- Was this only tested with logged in users? (sounds like it was, if we used Notifications to deliver the survey?)
  - This was tested against anonymous users, not using Echo notifications but a javascript functionality called 'mw.notification'
- got it. thanks!

debt created subtask T174106: Search Relevance Survey test #3: action items.Aug 24 2017, 9:37 PM

Final draft up at https://wikimedia-research.github.io/Discovery-Search-Adhoc-SurveyMVP/

debt closed subtask T171742: Search Relevance: MVP test (turn it off) as Resolved.Aug 30 2017, 8:20 PM

debt created subtask T175049: Investigate which languages we should run human relevance surveys on next.Sep 5 2017, 5:31 PM

Mentioned in SAL (#wikimedia-operations) [2017-09-08T23:46:22Z] <ebernhardson@tin> Synchronized wmf-config/CirrusSearch-rel-survey.php: T171740: Fix inverted sampling rates for human relevance survey (duration: 00m 47s)

Mentioned in SAL (#wikimedia-operations) [2017-09-09T01:12:54Z] <ebernhardson@tin> Synchronized php-1.30.0-wmf.17/extensions/WikimediaEvents/modules/ext.wikimediaEvents.humanSearchRelevance.js: T171740: Reduce annoyance of survey by enforcing minimum 2 days between showing survey to same browser (duration: 00m 46s)

Change 377014 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@master] Try even harder to not show survey multiple times

https://gerrit.wikimedia.org/r/377014

Change 377014 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Try even harder to not show survey multiple times

https://gerrit.wikimedia.org/r/377014

debt removed a subtask: T175049: Investigate which languages we should run human relevance surveys on next.Sep 12 2017, 5:40 PM

This seems like an awkward way to conduct a search relevance test. I certainly don't expect to see a pop-up about search relevance when reading an article and frankly it feels intrusive and distracting, bordering on annoying. I imagine for some people, it's also confusing.

Instead, why not ask users about the relevance of search results on the search results page itself? There it would have clear context and would also allow users to give feedback on multiple pages at once. Just my 2 cents. Feel free to ignore :)

In T171740#3616756, @kaldari wrote:

This seems like an awkward way to conduct a search relevance test. I certainly don't expect to see a pop-up about search relevance when reading an article and frankly it feels intrusive and distracting, bordering on annoying. I imagine for some people, it's also confusing.

Instead, why not ask users about the relevance of search results on the search results page itself? There it would have clear context and would also allow users to give feedback on multiple pages at once. Just my 2 cents. Feel free to ignore :)

Data on the results users are looking at is not particularly actionable. There will be a blog post soon detailing the purpose here, but the high level concept is we need to aggregate together numerous user responses to have confidence in a particular relevance judgement. We already have the ability to do this on high volume queries via click logs combined with statistical modelling of user behaviour, but we have no ability to get that information for long tail queries (roughly defined as those which are issued less than 10 times per 90 days) which make up ~60% of search traffic. This is designed to specifically address the long tail of search queries which we currently have no reasonable way to collect relevance judgements for.

To address the annoying problem we currently have a limit in place which should prevent showing the survey more than once every few days, per browser. The overall sampling rates are set to approximately 1 in 1000 page views, although some pages have higher sampling than others. If the data looks reasonable and we run further crowd sourced data collection in the future we will likely add a longer term (~permanent) opt-out behavior. Future tests should additionally be able to use lower sampling rates, as we are currently evaluating the difference between 4 different formulations of the question.

This is designed to specifically address the long tail of search queries which we currently have no reasonable way to collect relevance judgements for.

That makes sense, although it seems like a good problem for Mechanical Turk. Anyway, glad to know that you're actively trying to minimize the annoying factor :)

FYI, there's already been like 4 questions about this on WP:VP/T. Which is a LOT relatively speaking (esp. if you consider that most people won't find their way to WP:VP/T usually.

Also today @MauryMarkowitz reports:

Every time I visit the H2S radar page, a pop-up repeatedly appears in the upper right corner of the browser window asking me if this is a suitable article if one is searching for "Lancaster operators". I answer No, trying to be helpful. Then it asks me again. And again. And again. Is this something en.wiki is doing, or is this perhaps a 3rd party plugin? Anyone know what this is? Maury Markowitz (talk) 13:22, 19 September 2017 (UTC)

More feedback:
https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Mysterious_random_popup

The complaint that stood out to me most is:

" — there's a pop-up — how did that get here — it seems to be from Wikiepdia itself — it's asking me a question, I'll try to be helpful — it says 'Would you click on this page ...' — would I ever click on a page? — why might I click on a page? —" and before I could come up with a sensible answer, it vanished again. If this is intended as a way of getting useful feedback from users, it's doomed to failure

it seems like a good problem for Mechanical Turk. Anyway, glad to know that you're actively trying to minimize the annoying factor :)

@kaldari: Yeah, it does—which is why @EBernhardson built our own Mechanical Turk—called the Discernatron! The problem is that using it is really, really tedious. Given the constraints that we can't afford to pay people (so no Mechanical Turk) and a limited number of volunteers (getting the word out is hard), the plan for the Discernatron was to get people to look at a query, then review a bunch of possible results and rate them. It's extra tedious if you don't know a lot about what the query is talking about, or if you don't know what the article is about.

The survey solves a couple of those problems. More people know the survey exists because it comes to them instead of them finding it (though this can also be annoying). People who are on an article page are more likely to have some idea what that article is about than if they saw the article title and snippet on a search results page or in the Discernatron. There's still the problem of figuring out what the heck the query is supposed to mean, which is not always trivial. Alas.

I've written a blog post that gives a high-level and somewhat more organized overview of the project and what we hope to get out of it, if anyone else is interested.

For future iterations of the survey, some thoughts to consider. From the discussion @TheDJ also linked to above (archived here):

Making it look like a popup is really annoying to some people.
30 seconds isn't long enough for some people to engage with the survey.
We need an opt-out mechanism.
"high recall" query/article pairings can be more annoying or confusing because they don't make sense to the reader.
- This applies to some of the really "deep" results from the Discernatron data, but my guess is that it applies to some results in the top 20, too, especially when there just aren't many good results.
We should probably look at the Article Feedback Tool—and its Talk page archives—and try to avoid the mistakes made on that project.
We should consider not showing the same user the same query on the same article, ever. If someone is working on an article, say, and every two days they got the same survey on that article, that would be a different kind of annoying. (If an opt-out were available, this is when they'd use it.) OTOH, this may have technical challenges—we probably don't want to cram local storage full of a list of articles you've been surveyed about. Maybe something in prefs for logged-in users?

Over in T174106 a few more points are made, starting around T174106#3599816:

"high recall" results (or any not so good results) can be offensive because of the article/query pairing. This is not tractable to prevent because of volume and because of cultural differences between potential reviewers and readers. However, documentation could help.
yeah, on that opt-out button... sounds like it'd be popular! :P

Additional thoughts on the above:

Some of these problems may go away if we have a different UI. Popups are annoying, banners are differently annoying.
- So maybe a survey box under the info box, as Erik suggested in passing earlier today, might obviate the need for an opt-out since it would be less intrusive.
- It might also solve the problem of tracking to avoid repeat surveys because it's less intrusive.
- A constant survey box doesn't necessarily have to have a timeout, which solves the problem of figuring out the optimal timeout.
I haven't read through much of the Article Feedback Tool archive, but the issue of quality is brought up right away. I think our A/B tests address that and the first one showed that it works now, with random people—though as it becomes more widely known, we could have vandals.

TJones mentioned this in T174106: Search Relevance Survey test #3: action items.Sep 19 2017, 7:42 PM

TheDJ created subtask T176278: Cleanup localstorage key after relevance survey's are complete.Sep 19 2017, 10:09 PM

Thryduulf subscribed.Sep 19 2017, 11:03 PM

An obvious link to documentation, starting with a "what is this?", and a link to somewhere to leave feedback (optional) about the survey would be useful (I forget how I found the phab ticket, but it did involve a google search).
A fourth option to click on - "I want to answer in more words" would be great for people like me, but it would need to avoid the issues of the article feedback tool (alas I can't offer any suggestions how to do this off the top of my head).

Another glitch reported on the Village Pump (archived here): someone got a survey while on the diff page for the Manual of Style. It seems unlikely that the MoS would get a survey, but we definitely shouldn't have surveys on diff pages. Is it possible this was some sort of weird race condition and the survey was from an earlier page they had navigated away from?

Edit: A link to the exact diff has been provided! What a weird place for a survey to pop up.

debt removed projects: MW-1.30-release-notes (WMF-deploy-2017-08-01_(1.30.0-wmf.12)), Patch-For-Review.Sep 20 2017, 6:09 PM

debt added a subtask: T176428: Search Relevance test #4 - action items.Sep 21 2017, 4:52 PM

debt created subtask T178006: Search Relevance test #5: are users happy with the search results they got?.Oct 11 2017, 9:12 PM

debt added a parent task: T174064: [FY 2017-18 Objective] Implement advanced search methodologies.Oct 11 2017, 9:50 PM

TJones mentioned this in T182824: [epic] Show query-frequency-stratified results in A/B test results.Dec 13 2017, 8:54 PM

TJones mentioned this in T89970: Enable microsurveys for long-term tracking of editing experience .Jan 5 2018, 4:09 PM

TJones mentioned this in T186742: Predict relevance of search results from historical clicks using a Neural Click Model.Mar 20 2018, 3:15 PM

EBernhardson moved this task from not in use - please delete to Waiting on the Discovery-Search (Current work) board.Jul 3 2018, 5:32 PM

moving from current work to backlog board to reflect reality

Smalyshev moved this task from needs triage to [epic] on the Discovery-Search board.Jan 29 2019, 7:33 PM

Krinkle closed subtask T176278: Cleanup localstorage key after relevance survey's are complete as Declined.May 20 2020, 6:09 PM

CBogen closed subtask T178006: Search Relevance test #5: are users happy with the search results they got? as Invalid.Aug 28 2020, 2:58 PM

CBogen closed subtask T176428: Search Relevance test #4 - action items as Invalid.

CBogen closed subtask T174106: Search Relevance Survey test #3: action items as Invalid.

We have initial data, but enough time has passed that we probably want to redo most analysis. It does not seem that there is much value here.

	F9161493: example human search relevance survey
	Aug 24 2017, 3:53 PM

	F9102027: score_compare.png
	Aug 17 2017, 9:29 PM

	F9102005: block_buster.png
	Aug 17 2017, 9:29 PM

	F9102012: yesterday_beetles.png
	Aug 17 2017, 9:29 PM

	F9102009: star_and_stripes.png
	Aug 17 2017, 9:29 PM

[Epic] Search Relevance: graded by humansClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

[Epic] Search Relevance: graded by humans
Closed, ResolvedPublic
Actions

Related Objects
Search...