Page MenuHomePhabricator

multi term search does not return expected results
Open, MediumPublic

Event Timeline

Tnegrin raised the priority of this task from to Needs Triage.
Tnegrin updated the task description. (Show Details)
Tnegrin added a project: MediaWiki-Search.
Tnegrin subscribed.
Tnegrin set Security to None.

Hi Nik -- here's the broken query I mentioned on Friday. Thanks for taking a look and let me know if you need further info.

-Toby

Its a synonym problem. "us automobile production" finds what you expect as the top hit. We don't do synonyms right now. It was something that I'd wanted to work on and would have gotten around to eventually but its not as high on the list as wikidata query service. You can manually fix this by adding a redirect from "u.s. car production" to the page but its a bit lame. We should be able to automatically figure stuff like that out. In all languages too given that we could mine wiktionary.

Aklapper added a subscriber: Manybubbles.

[ Resetting assignee as assignee account is not active anymore ]

debt triaged this task as Medium priority.Jul 13 2017, 5:29 PM
debt moved this task from needs triage to This Quarter on the Discovery-Search board.
debt added subscribers: TJones, EBernhardson, debt.

We'll need to get some serious research work into this. It might be interesting for @TJones to take a look first. :)

Here's the note from @EBernhardson from the merged ticket:

It might be useful to include synonyms in search to improve recall. For example if a search for car transformed into (car OR automobile) as an example. Elasticsearch supports doing this as part of the analysis phase of indexing, we would need a list of synonyms to work with though. We could consider using WordNET, or perhaps extracting data from wiktionary.

This is also being discussed - but with a slightly different usage - in this conversation: https://www.mediawiki.org/wiki/Topic:Tti9vgefpnaztmol

Issue
As a user I'd like to be presented with suggestions to improve my search.

Background
Currently search depends entirely on a word either matching the search terms, or matching the title of a page. This reduces the usefulness of the search when a word can mean so many things, for example, looking for "trunk", one may mean a proboscis ("elephant's trunk"), boot (a part of a car), a part of a tree, part of a body, and so forth.

Proposed solution
Extract these from the page with a matching title for wiktionary search results much like the widget Cross-wiki Search Result Improvements/self-guided testing#Wiktionary. For example(https://en.wiktionary.org/wiki/trunk)

Provide a search suggestion: "you may be interested in : proboscis, boot ' using words extracted from the Synonyms sub-heading

Considering the different wiktionaries and different headings or rules in each wiki, this may not be feasible until there is some way to store these in a structured manner.

Even so, just showing the contents under the synonym (and similar ones in other wiktionaries) heading will be a good short term improvement.

This comment was removed by TJones.

[Once again, I've fat-fingered a half-written comment and had to delete it so I can finish my magnum opus! Sorry.]

Ugh, horribly this has gotten worse over time, as the desired result is no longer first for us automobile production. Two notes:

  • I'm going to blame word_break_helper which maps periods (and other things) to spaces, splitting up "U.S." in the desired title to "U" and "S", which does not match "us".
  • The desired article is the first and only suggestion from the completion suggester, which is matching a period-less redirect.

And thus my feelings about word_break_helper (yuck) and the completion suggester (yay!).

So, I think there are a few issues here:

  • using synonyms in search
  • using synonyms in suggestions
  • is word_break_helper even helping? (See T170625.)

While there is a common notion of synonyms (or a thesaurus), I think we should split up the topics of using a thesaurus for search and using one for suggestions. A thesaurus for search that is used automatically needs to be more tightly controlled than one used for suggestions, which are easier to skip over.

Enabling a thesaurus for searches is fraught with complications. Unfiltered WordNet is probably a bad idea; it is too complete and includes rare and archaic senses that are more likely to generate noise than not. Wiktionary might have the same problem, and definitely has a problem with being only semi-structured and thus hard to parse. I took a look at the pre-Cirrus/pre-Elastic search engine, and it only had one synonym entry: movie/film. (I support bringing that one back!)

I would assume we'd have some way to toggle the default thesaurus status, whether that is enabled or disabled. Unsophisticated newbie users are not going to know how to toggle it, so if it is on by default, it needs to be conservative so they don't get overrun with extra clutter they can't control. If it is off, they will probably never find it, even though they probably need it more than anyone else. Also, we probably shouldn't use quotes as the only way to disable the thesaurus, since that also disables language analysis. Just because I don't want lawyer to match attorney doesn't mean I don't want it to match lawyers.

So, I'd recommend a small, conservative, hand-curated, on-by-default thesaurus for searches. If that's not possible (because of the hand-curated part, esp. as it relates to all the languages we support), then I'd recommend either off-by-default or only using the thesaurus for suggestions, so that the clutter is kept to a minimum.

We'd also have to think about how this interacts with Learn to Rank. (We may have to get used to saying that a lot—but in a good way!) Esp. if the thesaurus is on-by-default, "matched a synonym" is probably a good LTR feature, and the newly introduced results might require a retraining, depending on how many of them there are.

recommend a small, conservative, hand-curated, on-by-default thesaurus for searches

I've seen this approach work well. It's time-consuming to create, but priceless when it's done.