Question mark in query causes it to turn up unrelated search results
Closed, ResolvedPublic

Description

Hi. If I go to en.wikipedia.org and try:
https://en.wikipedia.org/w/index.php?title=Special:Search&profile=default&fulltext=Search&search=How+old+is+tom+cruise
I see Tom Cruise as the top search result, as expected.
However if I add a question mark to the end, and search for "How old is Tom Cruise?" instead, I see totally unrelated/unwanted results. This behaviour should be fixed.

Niharika created this task.Apr 26 2016, 4:33 PM
Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptApr 26 2016, 4:33 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
EBernhardson added a subscriber: EBernhardson.EditedApr 26 2016, 4:39 PM

https://en.wikipedia.org/wiki/Help:Searching#If_you_cannot_find_what_you_are_looking_for

A common mistake is to type a question into the search bar and expect an answer;
some Web search tools such as Ask Jeeves support this. The Wikipedia search is a
text search only; questions, as such, can be asked at the reference desk and similar
places.

https://en.wikipedia.org/wiki/Help:Searching#If_you_cannot_find_what_you_are_looking_for

A common mistake is to type a question into the search bar and expect an answer;
some Web search tools such as Ask Jeeves support this. The Wikipedia search is a
text search only; questions, as such, can be asked at the reference desk and similar
places.

Thanks, @EBernhardson. I wasn't aware of that link.
Although I still think it's worth fixing this issue because not every new user is aware of that link and can inadvertently run into this problem.

Deskana triaged this task as "Normal" priority.EditedMay 24 2016, 10:12 PM
Deskana moved this task from Needs triage to Up Next on the Discovery-Search board.
Deskana added a subscriber: Deskana.

Thanks for filing this task, @Niharika. We discussed this briefly and we agree it's suboptimal and non-obvious behaviour. The course of action we'll probably take is simply to strip question marks from the query if it's the last character in the search string... which is kind of a hack, but it'll work for now and it's easy for us to do.

TJones added a subscriber: TJones.May 31 2016, 1:25 PM

Ugh, this is hard. Empirically, lots of people are getting goofy results because they use ? as an actual question mark. On the other hand, somebody somewhere must be using it as wildcard; and there may be tools that assume it works the way it works now.

If we always strip it from the end of the string, we could make certain searches impossible. (e.g., jumpe? to match jumper and jumped)

Only limiting it to very poorly-performing queries (e.g., zero results or fewer than three results) won't even cover the case that spawned this ticket.

A hacky hack would be to only strip it if it is the very last character (so a space at the end keeps it in play), but would be very opaque to the non-expert user.

Another hacky hack would be to strip spaces (or general whitespace) up to the last question mark on the end of a query, so that two ?s at the end of a query would leave one working ?. Non expert users use ?? from time to time already, but certainly less often than using a single ?. Displaying the modified query might help some people figure out what happened, but not all.

Like @Deskana said, we could just do the simplest most obvious hack.

We could announce the change is coming and see how many objections it raises.

Or we could think about implementing the "expert mode" we recently discussed, though that needs proper UI consideration, too, and is generally a much larger project.

ksmith added a subscriber: ksmith.May 31 2016, 10:05 PM
debt added a subscriber: debt.Jun 1 2016, 2:10 PM

@TJones I think we can do 3 things: announce the change, do the change (remove the ? if it is at the end of a string/phrase) and then analyze the data afterward to see if it made the search results better or worse.

If it turns out we actually made things worse, we can revert or do a side step and do something else to tackle this issue.

TJones added a comment.Jun 1 2016, 2:28 PM

@debt, it'll definitely make things better for a lot of poorly performing queries, and probably fix a good number more that get goofy results, like cruise?

TJones claimed this task.Jun 1 2016, 4:56 PM
TJones moved this task from Backlog to In progress on the Discovery-Search (Current work) board.

Adding @CKoerner_WMF per this morning's retrospective.

TJones added a comment.Jun 9 2016, 7:25 PM

Analysis of ?-final queries and the effect of removing the ? for top-10 Wikipedias:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Dropping_Final_Question_Marks_in_the_Top_10_Wikipedias

Cpiral added a subscriber: Cpiral.Jun 22 2016, 8:07 PM
EBernhardson wrote:
https://en.wikipedia.org /wiki/Help:Searching#If_you_cannot_find_what_you_are_looking_for

A common mistake is to type a question into the search bar and expect an answer;
some Web search tools such as Ask Jeeves support this. The Wikipedia search is a
text search only; questions, as such, can be asked at the reference desk and similar
places.

Niharika wrote: Thanks, @EBernhardson. I wasn't aware of that link.
Although I still think it's worth fixing this issue because not every new user is aware of that link and can inadvertently run into this problem.

Any search results page is always designed to refine searches until any such problems go away; minimalism subtly relies on such self-documenting behavior in many applications.

The major public search engines ignore ? marks to solve this issue. Ask Jeeves takes text searches, as do the others. CirrusSearch is designed a head-above all other public search engines because of regexp. This is a huge deal, a win for researchers and info scientists, and it needs that question mark as part of an ecosystem that allows us to make regex a last resort; furthermore, when regex are needed, the ? mark is a very important side-tool for regex queries -- the required regex filter.

I really appreciate the discussion and the research (wow), but.

  • If we delete a final ? mark it ruins the word? search. If Search changes word? to word~, or makes any other change to any query, it shows up on the search results page, and when the user goes to refine it, awareness of any change will absolutely create an urgent need for help. We would be forced to post a message about this problem at the top of every search results page, ala https://www.mediawiki.org/wiki/Special:LinkSearch.
  • Search hints (that drop down) are a powerful indicator of the text-search nature.
  • AI queries, and automatic regex-filter productions, that work transparently in the background are on their way; meanwhile, no faux AI please.
TJones added a comment.EditedJun 22 2016, 9:25 PM

Thanks for the thoughtful reply, @Cpiral!

Any search results page is always designed to refine searches until any such problems go away; minimalism subtly relies on such self-documenting behavior in many applications.

It's clear that a lot of inexperienced users are just asking questions, and as a result they get weird results or no results because they have a wildcard in their query. Given that many don't even know what a wildcard is, they may have trouble refining the query.

I don't have any direct evidence, but I think it is also the case that users, especially novice users, don't carefully read messages and alerts. For example, Google has to redesign their warnings because 70% of Chrome users ignore security warnings!

It seems easier to train the more expert users—and anyone who uses a wildcard is an expert compared to most users—to learn to escape their ? wildcards.

The major public search engines ignore ? marks to solve this issue. Ask Jeeves takes text searches, as do the others. CirrusSearch is designed a head-above all other public search engines because of regexp. This is a huge deal, a win for researchers and info scientists, and it needs that question mark as part of an ecosystem that allows us to make regex a last resort; furthermore, when regex are needed, the ? mark is a very important side-tool for regex queries -- the required regex filter.

The proposed solution wouldn't touch insource: queries, which seem to be where the real regex magic happens. Also, a backslash escape would allow the searcher to use ? as a wildcard, so under the new plan , word\? would be the same as word? now. We could also provide a warning (which the novice would ignore) indicating that the wildcard was ignored and a link with the escaped version for the expert user to follow. We haven't worked on the wording, but something like this:

  • Showing results for word. Did you mean to search with a wild card? Search instead for word\?

I really appreciate the discussion and the research (wow), but.

  • If we delete a final ? mark it ruins the word? search. If Search changes word? to word~, or makes any other change to any query, it shows up on the search results page, and when the user goes to refine it, awareness of any change will absolutely create an urgent need for help. We would be forced to post a message about this problem at the top of every search results page, ala https://www.mediawiki.org/wiki/Special:LinkSearch.
  • Search hints (that drop down) are a powerful indicator of the text-search nature.
  • AI queries, and automatic regex-filter productions, that work transparently in the background are on their way; meanwhile, no faux AI please.

I don't think this rises to the level of even faux A.I., and I don't think we'll be able to support the hardware for any kind of A.I. any time soon. (We have had people ask why we don't just use Watson for search.)

Based on the numbers, we are messing up many more novice users by a factor of at least 100. We should do something to help out the regular users without crippling the expert users. I'm certainly open to more suggestions!

mpopov added a subscriber: mpopov.Jun 23 2016, 12:09 AM
debt added a comment.Jul 12 2016, 10:07 PM

We'll set up a meeting to go over our options and review the communities response and our way forward.

debt added a comment.Jul 21 2016, 6:04 PM

We met this morning and here are our next steps;

  • update documentation on escaped and unescaped question marks on en.wiki and mediawiki
  • finalize code
    • update config to be 'on' by default
    • confirm that we can turn off - by wiki - if determined that it's needed
  • release to production when ready
  • email community lists

@debt : Which option was selected? Ignoring trailing question marks? Leading and trailing? All? Or some non-ignore option?

@ksmith, we decided to go with stripping all question marks (replacing them with spaces to preserve word boundaries) as the default. There are options for (i) turning off ?-stripping, (ii) making it only query-final, and (iii) making it apply only at word boundaries. Any of those could be set for individual projects if the default turns out to be a bad choice.

Escaped question marks are treated properly in all contexts where stripping is done, so escaping them will do the right thing. (insource queries, which allow full regexes, are unaffected and should not be escaped).

Change 301170 had a related patch set uploaded (by Tjones):
[WIP] Strip Question Marks from Queries by Default

https://gerrit.wikimedia.org/r/301170

Change 301170 merged by jenkins-bot:
Strip Question Marks from Queries by Default

https://gerrit.wikimedia.org/r/301170

Trizek-WMF added a subscriber: Trizek-WMF.

This update is on the train now and will be released to the bigger wiki's on Thursday, Aug 4th.

I guess this is for User-notice :)

debt closed this task as "Resolved".Aug 8 2016, 4:40 PM