CirrusSearch seems to stem the word "used" to "us"!
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Manybubbles
	Sep 11 2013, 4:55 PM

Description

CirrusSearch seems to stem the word "used" to "us" sometimes!

<elasticsearch>/nikwiki_general/_analyze?analyzer=text&text=used returns
{

"tokens": [
  {
    "token": "us",
    "start_offset": 0,
    "end_offset": 4,
    "type": "<ALPHANUM>",
    "position": 1
  }
]

}

Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=54875

Details

Reference: bz54022

Related Objects

Mentioned In: T56875: Automatic stopwords for the 200+ languages without their own analyzer available

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:08 AM

• bzimport added a project: CirrusSearch.

• bzimport set Reference to bz54022.

• Manybubbles created this task.Sep 11 2013, 4:55 PM

I might be able to fix this by switching stemmers. I'll do some more research tomorrow.

Change 86854 had a related patch set uploaded by Manybubbles:
Tests for places where kstem beats porter stemmer.

https://gerrit.wikimedia.org/r/86854

Switching stemmers.

Implementation: https://gerrit.wikimedia.org/r/#/c/86853/
Regression tests: https://gerrit.wikimedia.org/r/#/c/86854/

MErged.

"The kstem token filter is a high performance filter for english"
http://www.elasticsearch.org/guide/reference/index-modules/analysis/kstem-tokenfilter/

So I don't need to test what the effects are of this change for other languages?

Change 86854 merged by jenkins-bot:
Tests for places where kstem beats porter stemmer.

https://gerrit.wikimedia.org/r/86854

Right, this only effects English.

Unfortunately (or fortunately for a small set of use cases) there aren't as many different options for languages other than English. I believe we have five options, in order of how much they increase recall and decrease precision:

No stemming
Minimal (just possessives)
KStem
Porter Stemmer
Porter Stemmer via Snowball

A few other languages have "minimal" (or "light") stemmers in addition to their more aggressive versions. In all cases other than English at this point we use the Elasticsearch default which is the more aggressive version.

Switching from the Elasticsearch default to a customized version isn't hard and we're totally willing to do it.

Sorry for going offtopic with my stupid questions, mainly I'd like to make a list of possible weaknesses e.g. for Italian analysis so that users can specifically test them a bit.

(In reply to comment #7)

Right, this only effects English.

Unfortunately (or fortunately for a small set of use cases) there aren't as
many different options for languages other than English. I believe we have
five options, in order of how much they increase recall and decrease
precision:

No stemming

Minimal (just possessives)

KStem

Porter Stemmer

Porter Stemmer via Snowball

A few other languages have "minimal" (or "light") stemmers in addition to
their
more aggressive versions. In all cases other than English at this point we
use
the Elasticsearch default which is the more aggressive version.

Our default is standard i.e. http://www.elasticsearch.org/guide/reference/index-modules/analysis/standard-tokenizer/ or the language default for those which have one ( http://www.elasticsearch.org/guide/reference/index-modules/analysis/lang-analyzer/ ) so the stopwords we're using are those linked from http://www.elasticsearch.org/guide/reference/index-modules/analysis/snowball-analyzer/ ?

Switching from the Elasticsearch default to a customized version isn't hard
and
we're totally willing to do it.

Good! I guess you'll need help from native speakers and that they'll need some pointers from the docs on how to help.
30 languages < 285, so maybe – when you start expanding to many languages – as a starting point cutoff_frequency can be used to replace stopwords lists where one is not available as mentioned in https://gibrown.wordpress.com/2013/05/01/three-principles-for-multilingal-indexing-in-elasticsearch/ ? That would be a possible enhancement to file separately.

Yeah, it is probably worth opening a new bug with specific things, but you are right about help from native speakers.

As far as stopwords go there is a thing in elasticsearch called a common_terms query that can be used to kind of simulate having stopwords. In some respects it is better than having stopwords so folks can turn them off and use it instead. But getting it working with the query syntax that we use now is going to be rough.

Additionally we probably want to turn CirrusSearch on even for languages that aren't in that 30 mostly because we're likely to be better than lucene-search. Except in Esperanto.

verified on test2wiki.

• Deskana moved this task from Inbox to Resolved/Invalid/Declined/Legacy on the CirrusSearch board.Apr 20 2015, 4:12 AM

Liuxinyu970226 mentioned this in T56875: Automatic stopwords for the 200+ languages without their own analyzer available.Sep 24 2018, 7:21 AM

CirrusSearch seems to stem the word "used" to "us"!Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

CirrusSearch seems to stem the word "used" to "us"!
Closed, ResolvedPublic
Actions