Create an ordered list of languages we want to find new analysers for
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Deskana
	Jan 17 2017, 9:09 PM

Description

This quarter we're researching new language analysers (see relevant mailing list post). We have some ideas for what languages need new analysers (e.g. Polish, Chinese, Hebrew, etc.) but we'd like to take a more structured look at the problem to see where we can best focus our efforts. For the sake of expediency, we'll still be starting with Polish (see T154516).

Sadly, this is something that's hard to involve volunteers in; native speakers of languages are not exceptionally hard for us to find, but most users have no idea about things like tokenisation, n-grams, and such, so explaining to them exactly what we want is fairly hard.

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T174065 [FY 2017-18 Objective] Improve support for searching in multiple languages
Open	None	T154511 [Tracking] Research, test, and deploy new language analyzers
Resolved	TJones	T155549 Create an ordered list of languages we want to find new analysers for

Event Timeline

• Deskana created this task.Jan 17 2017, 9:09 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 17 2017, 9:09 PM

• Deskana triaged this task as Medium priority.Jan 17 2017, 9:10 PM

• Deskana moved this task from needs triage to Current work on the Discovery-Search board.

• Deskana edited projects, added Discovery-Search (Current work); removed Discovery-Search.

• Deskana added a parent task: T154511: [Tracking] Research, test, and deploy new language analyzers.Jan 17 2017, 9:12 PM

It seems like potential easy wins are the Elastic Core Plugins, which include Polish (Stempel) and Ukrainian (Morfologik), languages which have been mentioned before as needing improvement. It also includes Japanese (Kuromoji) and Chinese (SmartCN)—though I have a vague recollection that those may not perform very well.

That page also lists a Hebrew Analysis plugin, and we have the other one, mentioned in the parent Epic (T154511).

So my first draft of a list would be:

Polish—Elastic says it "provides high quality stemming for Polish", and it's probably easy.
Chinese—we really need this, and we know of SmartCN and others to consider.
Ukrainian—Elastic has one, though it only "provides stemming for Ukrainian" (no "high quality claim"); we're currently using Russian, which is better than nothing, but not at all great.
Hebrew—Recently requested / suggested, and Elastic suggests HebMorph as well.
Japanese—We're using CJK analysis in production, which is just bigrams. Maybe Kuromoji is better?

None of these are too far off the beaten path since Elastic recommends all of them; we'll gain some expertise and learn how to do this process better, especially working with the community for review and evaluation, while hopefully not having too many technical hurdles to deal with.

Also, I don't think we'll finish all 5 by the end of the quarter, but let's see how Polish goes.

In sprint planning, we discussed the list and it looks good - @TJones will start with Polish.

@dcausse noted that SmartCN couldn't handle both traditional and simplified Chinese (so we'll have to see if that's still the case), and that there is another Polish analyzer using Morfologik (same framework used for Elastic's suggested Ukrainian) but it isn't mature enough to use.

I think this task is done, since it's just to establish the list, right?

TJones mentioned this in T154511: [Tracking] Research, test, and deploy new language analyzers.Jan 24 2017, 9:18 PM

TJones moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.

Thanks for the list.
I consider this task done, more precise questions will have to be answered in research task like T154516.

• Deskana closed this task as Resolved.Jan 30 2017, 6:13 PM

Create an ordered list of languages we want to find new analysers for Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Create an ordered list of languages we want to find new analysers for
Closed, ResolvedPublic
Actions

Related Objects
Search...