Page MenuHomePhabricator

Search Relevance test #4 - action items
Open, MediumPublic

Description

For the fourth running of the search relevance test, we'd like to try a few things differently:

Backend:

  • Do not show the same user the same query on the same article, ever. 
  • Increase the timeout to 60 seconds for survey to display (currently 30 sec)
  • Be sure to only show survey on article pages (not diff's or anywhere else)

Frontend:

  • Update wording for the question to ask
    • "If you searched for ‘Naval flags’ would this article be relevant?"
  • Update wording for answer text
    • Yes, I don't know, No
  • Add link to wiki page where the search relevance test is explained (and offer the talk page for discussion)
    • "Why we are asking this?"
  • Add opt-out for logged in users that would last forever
    • "Opt out of search relevance questions"
  • Add 'close' button in top right of survey display
  • Keep privacy policy link with slight update to wording
    • "Privacy policy information"

To do:

  • write wiki page for why we're testing and request feedback
  • turn on test
  • turn off test
  • analyize results
  • plan next test

Event Timeline

debt created this task.Sep 21 2017, 4:52 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 21 2017, 4:52 PM
TJones added a comment.EditedSep 21 2017, 4:59 PM

Is "maybe" going to go into the same bucket that "i don't know" has gone into? They aren't quite the same.

Do we want to talk about some other possible non-popup UI? That seemed to be the most annoying—though the forever opt-out will solve that problem for the most annoyed people.

Elasticsearch, for example, puts their "would you like to sign up for an account" message at the end of the discussion page. Though then we'd wouldn't get many responses for really long articles.

debt added a comment.Sep 22 2017, 8:20 PM

Is "maybe" going to go into the same bucket that "i don't know" has gone into? They aren't quite the same.

They're not quite the same, but I'm not sure that that distinction is made outside of our team (i.e. do users of the survey feel that the definition of 'I don't know' and 'maybe' are basically the same?)

Do we want to talk about some other possible non-popup UI?

I'd like to keep it as a popup survey for now. Adding in a forever opt-out option for logged in users is a big win for those that are on wiki for tons of time and don't want to answer a survey. I'm also concerned that if it goes anywhere else, our feedback will diminish drastically because the users won't see the survey (especially if we place it at the end of the article page).

I'm also concerned that if it goes anywhere else, our feedback will diminish drastically because the users won't see the survey (especially if we place it at the end of the article page).

Yeah, there are definitely some trade-offs there. I worry about making the mistakes of the Article Feedback Tool, though. The forever opt-out should be enough. Hopefully we can find a middle ground of not-too-annoying and effective data collection.

Is "maybe" going to go into the same bucket that "i don't know" has gone into? They aren't quite the same.

They're not quite the same, but I'm not sure that that distinction is made outside of our team (i.e. do users of the survey feel that the definition of 'I don't know' and 'maybe' are basically the same?)

I'd keep them separate. For example if I was survey about searching for 'half adder truth table' then I'd very likely answer "I don't know" as I have no idea what a half adder is and I'm sketchy about truth tables, so unless the article I happened to be reading happened to be explicitly about truth tables for half adders I wouldn't know whether the page was relevant or not. However if I was surveyed again about 'naval flags' on the Red ensign article then "maybe" would be a better answer as it would be relevant to some queries but not others.

Separately, by "timeout" do you mean the length of time the popup is displayed before it disappears? If so I'd like a way of stickying it until I can answer. I do a lot of general browsing of Wikipedia and if a question comes up like that when I'm not looking up something in particular then its very probable that I'll load a new tab or two and do some research (such as what search results currently appear for that and similar terms) before answering the question if I'm not initially sure.

debt added a comment.Sep 25 2017, 6:42 PM

In T176428#3628266, @debt wrote:

Is "maybe" going to go into the same bucket that "i don't know" has gone into? They aren't quite the same.

They're not quite the same, but I'm not sure that that distinction is made outside of our team (i.e. do users of the survey feel that the definition of 'I don't know' and 'maybe' are basically the same?)

I'd keep them separate.

Are you thinking that we should have both responses available: maybe and I don't know?

Separately, by "timeout" do you mean the length of time the popup is displayed before it disappears?

Yes, we're thinking of the timeout of the survey popup that is being displayed, because we don't want to annoy folks with it staying around too long. But, we had felt that maybe 30 seconds was too short, so we wanted to test lengthening it to 60 seconds before it fades away.

In T176428#3628266, @debt wrote:

Is "maybe" going to go into the same bucket that "i don't know" has gone into? They aren't quite the same.

They're not quite the same, but I'm not sure that that distinction is made outside of our team (i.e. do users of the survey feel that the definition of 'I don't know' and 'maybe' are basically the same?)

I'd keep them separate.

Are you thinking that we should have both responses available: maybe and I don't know?

Yes, if those with more knowledge of human factors than me don't think that would be too complicated.

Are you thinking that we should have both responses available: maybe and I don't know?

Yes, if those with more knowledge of human factors than me don't think that would be too complicated.

That's why I asked originally—I think they should be separated because they do mean different things, as discussed above. "maybe" means, "this is a mediocre fit", and "I don't know" means "I'm not qualified to judge".

On the one hand, I think "I don't know" is more useful than "maybe" if we can only have one of them, because "maybe" comes out of the proportion of yes/no answers, though I'd take all four options if that's possible (though it may not be).

On the other hand, maybe @debt is right and most people won't make such fine distinctions. The fact that @Thryduulf agrees with me is not really much evidence, since anyone bothering to come post on Phab—much less re-organize naval flags—is several standard deviations above the mean on the involvement and motivation scale and not indicative of typical users. (So, not average, but very much appreciated!)

In response to Deb's question from IRC:

If we keep 'I don't know' and add in 'maybe' what would that do to our test?

Having all four options would change the stats, and @mpopov already built a classifier based on the Discernatron data before. If we introduce "maybe" we'd need to recalibrate that, which means this next test would have to be on the Discernatron data; is that the plan? If so, maybe it'd give a better model—or maybe it'd be more noise. I don't know.

(And now every time I use "maybe" or "I don't know" I'm analyzing what it means and how they differ. Working on this stuff causes brain damage!)

I think another important thing to figure out will be how many survey impressions we think we need to reliable information (maybe already covered).

debt added a comment.Oct 11 2017, 9:27 PM

We chatted about this test and others (see T171215#3677637 for notes taken during the meeting) and created a new survey to ask if users liked the search results they received (T178006).

This test can go forth, but we'll use 'I don't know' rather than 'maybe' to keep the data parameters the same as the previous tests that we've run.

debt updated the task description. (Show Details)Oct 11 2017, 9:27 PM

Change 384178 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@master] Update search relevance survey based on feedback from last test

https://gerrit.wikimedia.org/r/384178

Change 384179 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@master] Track search survey history in backend

https://gerrit.wikimedia.org/r/384179

Change 384180 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@master] Add opt out to relevance survey

https://gerrit.wikimedia.org/r/384180

Change 384181 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@master] Add close button to relevance survey notification

https://gerrit.wikimedia.org/r/384181

EBernhardson added a comment.EditedOct 13 2017, 10:39 PM

Patches are up. Some minor changes from the spec:

  • The wording for privacy statement / opt out were just too wide to fit in the notification. The notifications window is 20 em's wide (basically 20 letters wide). Shrinking the text a little can make a little more fit, but it has to be quite small which I'm not comfortable with on something as important as the privacy policy. As such the text in the last line is: "Privacy Statement | Opt out of survey"
  • The do not show same article to same user, ever, has the requirement that it only works for a logged in user. I could write some extra code to provide the same functionality in javascript, but i'm wary of adding a bunch of extra javascript that's shipped to every page on all our sites for that. After this series of patches anonymous user behavior is the same as before, surveys are shown at least 2 days apart but no knowledge of prior surveys is maintained. The implementation also has a side effect of having more false positives (thinking a user has seen a title+query they havn't actually seen) as a user has seen more and more surveys, but the user will have to have seen quite a few surveys for that error rate to get particularly high. (58 for 1% false positive, 275 for 10% false positive). If the user has seen 275 surveys they probably don't want to see them anymore anyways. With the 2 day timeout that should also take more than a year.
  • Anonymous opt out is also on a best effort basis, the information is stored in local storage same as the timeout for how long until we can show a survey again. Logged in users have their opt out tracked on the backend which has stronger guarantees.

One thing i'm indecisive on, this new version now requires logged in and anonymous users to make an api request before the survey is shown to them. That could be a reasonably large number of new api requests. I don't think it's problematic, but if others do i could rework things so only logged in users have to make the api request.

It all sounds great, Erik! I have no strong feelings on the API requests, but do you have an estimate of how many a reasonably large number would be, say, per day? For those who worry could better calibrate their worry. I guess I'm not worried because it doesn't seems like it can be all that many.

debt added a comment.Oct 16 2017, 5:54 PM
  • The wording for privacy statement / opt out were just too wide to fit in the notification.

Yup, this is fine

  • The do not show same article to same user, ever, has the requirement that it only works for a logged in user.... With the 2 day timeout that should also take more than a year.

This is good, and we shouldn't add in a bunch of JS that would be slowing down every page for anon users.

  • Anonymous opt out is also on a best effort basis, the information is stored in local storage same as the timeout for how long until we can show a survey again. Logged in users have their opt out tracked on the backend which has stronger guarantees.

Sounds good!

One thing i'm indecisive on, this new version now requires logged in and anonymous users to make an api request before the survey is shown to them. That could be a reasonably large number of new api requests. I don't think it's problematic, but if others do i could rework things so only logged in users have to make the api request.

Could we run a test on this for a day or two and see what the servers do? If we think it'll be an issue, let's only do the api request for logged in users.

Need to evaluate how we want to proceed on this test, based off our latest.

TJones added a comment.EditedMar 20 2018, 6:31 PM

Should this go in "Needs Review"? It seems to cover multiple topics, so we need a column labeled "It's Complicated"

Change 384178 abandoned by EBernhardson:
Update search relevance survey based on feedback from last test

https://gerrit.wikimedia.org/r/384178

Change 384179 abandoned by EBernhardson:
Track search survey history in backend

https://gerrit.wikimedia.org/r/384179

Change 384180 abandoned by EBernhardson:
Add opt out to relevance survey

https://gerrit.wikimedia.org/r/384180

Change 384181 abandoned by EBernhardson:
Add close button to relevance survey notification

https://gerrit.wikimedia.org/r/384181

Regarding Yes / Maybe / No / Don't know, would the classification problem go away/become easier to solve if it was instead:
Yes / Maybe / No / Skip (this question)? The last returning no data about how good the match is (i.e. the same as if they hadn't been asked the question). Possibly keep track of the number of people who chose to skip so that any with a particularly high number could be human reviewed to see why (probably a niche topic, but maybe the question doesn't make sense).

Aklapper removed EBernhardson as the assignee of this task.Fri, Jun 19, 4:20 PM

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)