Page MenuHomePhabricator

Search returns random results when search query begins with a hyphen
Closed, DeclinedPublic

Description

https://en.wiktionary.org/w/index.php?search=-happy&title=Special:Search
https://en.wiktionary.org/w/index.php?search=-blablabla&title=Special:Search

All such searches return the same set of random results:

  1. vīriešu
  2. 하십시오체
  3. viņas
  4. paykuna
  5. -li

etc.

Exact title matches are found: https://en.wiktionary.org/w/index.php?search=-et&title=Special:Search&fulltext=1 But that is the only useful result you get.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt triaged this task as Medium priority.Jul 11 2017, 10:13 PM
debt added subscribers: EBernhardson, dcausse, debt.

This also happens on Wikipedia as well, where having a "-" (dash) directly in front of a one word search query returns the exact same odd set of results:

https://en.wikipedia.org/w/index.php?search=-cat&title=Special:Search&profile=default&fulltext=1
https://en.wikipedia.org/w/index.php?search=-fullnelson&title=Special:Search&profile=default&fulltext=1

dash-in-search-query_cat.png (857×613 px, 175 KB)

dash-in-search-query_fullnelson.png (862×610 px, 176 KB)

- is part of the search syntax and indicates that you want to exclude results containing the search term. Exception is exact title match (to catch page titles starting with a dash).
-happy will display all pages without happy in their content.
The results seem random because there are no terms to base the scores on.
This is imo an expected behavior, searching only with one negated term is usually useless (i.e. google displays nothing for -happy) they are most frequently combined: sad -happy

I was attempting to search for the "suffix" -happy (as in trigger-happy), in case it helps understand where I was going with my search.

In any case, this is surely not "expected" from the end user's perspective. Displaying no results would be better than displaying random results.

Even wihout - being used as the negation in the search syntax searching for - is difficult as it is always removed by the tokenizers, I think that using a insource:/-happy/ search would bring results closer to what you expect, or even insource:/ [a-z]+-happy/ to exclude -happy used in URLs.

I don't have strong opinions on whether a single negated term should returns results or not. I believe that returning results has some value:

  • counting (not looking at the results but the number of results)
  • chain multiple negated words to have a reduced list of page
  • small wikis

Searching for "*-happy" on wiktionary gets you to the search results I think you're wanting to get to, @TTO

happy-wiktionary.png (959×553 px, 216 KB)

Let's take a look at this in ElasticSearch parser and see if we can change the UI display of using negative symbol before a word - to make the UI display nothing if we don't have any results.

Depending on the investigation of this, we'll determine the work to be completed and when.

Let's take a look at this in ElasticSearch parser and see if we can change the UI display of using negative symbol before a word - to make the UI display nothing if we don't have any results.

Most instances of -<word> return what looks like the same set of results, because most most articles don't have <word> in them. But when I search for -the I really do want all the articles that don't have the in them (or at least a count of them, as David suggested above). It's not Extremely Valuable Research™, but searching for -the then -of then -and (and then -the -of -and) on English Wikipedia is a neat kind of discovery of language that anyone can do and that I wouldn't want to block if we don't have to.

So, I'm not sure it's better to display nothing for -word. Google does do that: searching for -happy gives no results, but birthday -happy does what you'd expect. But giving no results is neither the outcome @TTO wants (find the string "-happy") nor the outcome I want (find everything without "happy"—ooo, that makes me sound so very sad).

I think this is a case where we need more documentation. Maybe we should look instances of potentially confusing power-user syntax and catch them and give documentation hints. So, if you search for -happy then you get the standard results but also a message that says "Did you mean to search for articles without happy in them? If not, try searching for insource:/-happy/." (Another obvious one is a question mark: "Use \? as a single-character wildcard".)

Another options is changing syntax to be unambiguous, but I think that's a non-starter because many people would never discover the new syntax and there's nothing that someone can't make ambiguous. At my first job, someone wanted to search for commas! Commas!

Or maybe we just make sure the documentation includes an example like this and that's enough.

By the way, I suggest insource:/-happy/ over insource:/ [a-z]+-happy/ unless you know exactly what you are looking for or really can't handle extra results like URLs or file names. insource:/ [a-z]+-ridden/, for example, won't find cliché-ridden—because English is awful.

Let's take a look at this in ElasticSearch parser and see if we can change the UI display of using negative symbol before a word - to make the UI display nothing if we don't have any results.

I think this is a case where we need more documentation. Maybe we should look instances of potentially confusing power-user syntax and catch them and give documentation hints. So, if you search for -happy then you get the standard results but also a message that says "Did you mean to search for articles without happy in them? If not, try searching for insource:/-happy/." (Another obvious one is a question mark: "Use \? as a single-character wildcard".)

I think adding in documentation like that is perfect, maybe on https://www.mediawiki.org/wiki/Help:CirrusSearch ?

I'm not entirely sure how hard it would be to add in a message that would show to the user, something like

"Did you mean to search for articles without happy in them? If not, try searching for insource:/-happy/."

but, that would be the *best* place to put documentation like that—where the vast majority of our users would be able to see it and then use the suggestions.

Okay, I've edited the CirrusSearch Help page to explain how to do this. I also changed the example to -in-law because the intent is more obvious (in English) than -happy or -ridden.

After reading over the help page, I realize that we should also change our advice here to not use just insource but also the quoted "plain" query so that we aren't running the insource regex over the entire document collection. (Example below.) The quoted plain query really cuts down on the number of documents the regex has to scan, and also excludes documents where the only match is in wikitext (e.g., in URLs or link text).

There can still be false positives, though, especially when the search term matches a key element of the entry, which then matches the name or URL of links or media on the page. For example, with the -happy search (see below), the first hit is the entry for happy which includes sound files with "-happy" in their name. However, trigger-happy, slap-happy, not-happy-Jan, dance-happy, sack-happy, shiv-happy, thrice-happy and I'm-so-happy-to-see-you-ing all show up on the first page of results.

I also realized that this could come up with the rare names that start with an exclamation point, like !Kung (where ! stands for a click).

So, the queries you should run would be like this:

  • "happy" insource:/-happy/i
  • "ridden" insource:/-ridden/i
  • "in law" insource:/-in-law/i
  • "kung" insource:/!kung/i

Given that the more complex query is much more performant, it makes the idea of giving documentation hints on the search page harder. I suppose we could do something like this, where "click here" is a link to the complicated search.

  • Did you mean to search for all articles without happy in them? If not, click here.
  • Did you mean to search for all articles without in-law in them? If not, click here.
  • Did you mean to search for all articles without kung in them? If not, click here.

Obviously, some more wordsmithing wouldn't hurt if we want to go that way.

Two other things...

  1. Both -in-law and !kung work fine on English Wiktionary because they are exact title matches—so all this complexity isn't necessary for title/redirect matches. You still get all the negated matches, though. Not sure how I feel about that; for @TTO's use-case, all the negated matches are superfluous. In my use-case, the exact title match is superfluous! OTOH, using the Go feature (upper right search box) takes you right to the -in-law or !Kung page, so that works out well.
  2. This seems like a good topic for a blog post, with a title along the lines of "So -happy to meet you." @debt, what do you think?

This seems like a good topic for a blog post, with a title along the lines of "So -happy to meet you."

Sounds like it'd be a great blog post, @TJones! :)

@debt, is there anything left to do for this task? I don't think we want to completely disable single-term negation searching because of the three use cases @dcausse outlined above (T170099#3429153).

I've added documentation to explain the problem and how to work around it, and written a blog post about it to help spread the word.

Remaining options seem to be:

  1. Change the negation syntax so -in-law and !kung aren't problems. This would break from standard practice of using dash as negation; not sure about exclamation.
  2. Return nothing but exact title matches in the case of single-term negated queries. I thnik this would require adding a third negation syntax for advanced users (e.g., NOT kung) to support the uses cases David laid out.
  3. Return the "Did you mean to search for articles without <x> in them?" message + link to the complex query in the case of single-term negated queries.
  4. Call it good enough and decline/close the ticket.

#1 seems like a bad idea. #2 pushes the complexity from the naive user to the advanced user, but it reminds me of this, and one should not multiply entities without necessity. #3 seems doable, but the parser is a mess and adding anything in there may be more difficult than expected. #4 is definitely easiest, and I'd be particularly happy to push for it if @TTO agrees we've done enough.

For #1, #2, or #3, I guess we keep this ticket for prioritization.

We can move this to the later column — let's see if we can get traction on the #3 option when doing parsing work in Q3...or...decide it doesn't need to be done and if so, we can close this ticket then. Thanks for the great writeup and blog post! :)

I'm going to go ahead an close this. I don't think we're going to have time to explore option 3, and hopefully the documentation and the blog post can help people understand what's going on. Please re-open if you think it's closed in error.

It's all good, @TJones. Thanks to you and @debt for investigating this in such detail back in 2017. I remember I was very impressed when I read the comments on this task, and especially the blog post! I regret not expressing my gratitude or commenting here at the time.

I regret not expressing my gratitude or commenting here at the time.

@TTO—no worries, and thanks for the kind words! I'm always glad to help people understand search better, and to have an excuse to write a blog post about it!