Page MenuHomePhabricator

Wrong stemming in German
Closed, DeclinedPublic

Description

In de.wikipedia, searching for ~Hamster finds (amongst others) "Häme", which is not related to "Hamster".

Event Timeline

FriedhelmW raised the priority of this task from to Needs Triage.
FriedhelmW updated the task description. (Show Details)
FriedhelmW added a project: CirrusSearch.
FriedhelmW subscribed.

@FriedhelmW: Did you reproduce this problem on a local MediaWiki instance with its default search backend (if yes: which MediaWiki version?), or why did you add the project "MediaWiki-Search" to this task? Clarifying comment welcome. :)

bd808 triaged this task as Low priority.Feb 25 2015, 7:52 PM
bd808 subscribed.

I don't think this is fixable. Stemming is never perfect. Both Häme and Hamster stem to the same root: ham, which, though wrong, seems reasonable.

Häme in addition to meaning "spite" is also the plural of Häm, meaning "heme" (a component of blood). German stems also seem to have umlauts removed. Hamster looks like an adjective with the -ster ending (compare höchster, "highest").

These kinds of stemming collisions happen in every language. In addition to stemming the text (in the "text" field) we keep a copy of the unstemmed original (in the "plain" field), which allows us to rank exact matches more highly when they occur.

Theoretically, Hamster could be added to a list of exceptions that doesn't get stemmed, or get stemmed by dictionary lookup rather than by rule, but this should happen upstream in the Elastic language analyzer.

More generally, you can prevent stemming by using quotes around a word, as with "Hamster" . Hopefully that's good enough for most practical needs.

debt subscribed.

Based on @TJones comment above, we'll go ahead and decline this; using quotes around the word(s) seems to be a good workaround.

@TJones: Thanks for responding. Why doesn't Cirrus rank an exact match much higher than a stemmed match? Google generally gives better results than Cirrus, even without using quotes.

This comment was removed by TJones.

(Sorry, I accidentally submitted my half-written comment via unexpected keyboard combo.)

In T87112#3375992, @FriedhelmW wrote:

Why doesn't Cirrus rank an exact match much higher than a stemmed match?

Cirrus takes a lot of factors into account to rank the results. Exact match, title match, number of matched words in the article, article popularity (including views and links). The presumably best (or at least simplest) match is the exact title match "Hamster". Other articles with "Hamster" in the title are ranked highly, too. The "Häme (Stoffgruppe)" article has a stemmed title match and has a very high density of Häme in it, which is probably what pushed it up the list.

Google generally gives better results than Cirrus, even without using quotes.

Google generally gives better search results than everyone. They also have net income in the tens of billions of dollars and an amazing amount of data and compute power—so a small team of open source developers can't really compete. We rely on other open source developers (yay, Elasticsearch!) to provide the core search engine and language analysis functionality, and we adapt it as best we can to meet the needs of the users of Wikipedia, et al., while trying to support the goals of open knowledge, open source software, and maximal privacy.

We're a little off topic here, but I've come to accept that we're never going to be in the same league as Google, and my goal is just to do as much as I can to improve search specifically and open knowledge in general with the resources we have. I think we do really well on a per-dollar basis!

I'm always open to whatever we can realistically improve—but tuning at the level of individual words across even 10 (much less ~280 languages) just isn't tenable for a team of our size.