[Epic, Q3 Goal, Q4 Goal] Research, test, and deploy new language analysers
Open, NormalPublic

Description

Two ways to start:

  • Languages that we really want to make big improvements on because we don't support them well (e.g. spaceless languages)
  • Test analysers that we know to be very mature (e.g. there's a Polish analyser that @dcausse knows about and likes)

Things to consider:

  • How much better the analyser is than what we've got
  • Maintainability of the code of the analyser
  • [add more!]

Languages/analyzers to consider (from T155549):

  • Polish—Elastic says theirs "provides high quality stemming for Polish", and it's probably easy. (T154516 / T154517)
  • Chinese—we really need this, and we know of SmartCN and others to consider. (T158202 / T158203 )
  • Ukrainian—Elastic has one, though it only "provides stemming for Ukrainian" (no "high quality claim"); we're currently using Russian, which is better than nothing, but not at all great. (T160105 / T160106)
  • Hebrew—Recently requested / suggested, and Elastic suggests HebMorph as well. ( T162739 / T162741 )
  • Japanese—We're using CJK analysis in production, which is just bigrams. Maybe Elastic's Kuromoji is better? ( T166731 )

We've almost made it through this list. More to consider:

Related Objects

Deskana created this task.Jan 3 2017, 7:43 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 3 2017, 7:43 PM
Deskana renamed this task from [EPIC] Research, test, and deploy new language analysers to [Epic, Q3 Goal] Research, test, and deploy new language analysers.Jan 3 2017, 7:44 PM
Deskana triaged this task as Normal priority.
Deskana moved this task from Needs triage to Current work on the Discovery-Search board.
Deskana added a project: Epic.
TJones added a subscriber: TJones.Jan 11 2017, 6:04 PM
This comment was removed by TJones.
TJones updated the task description. (Show Details)Jan 11 2017, 6:40 PM

HebMorph was recommended by @Matanya. It was investigated some time ago by Matanya and Nik (@Manybubbles). It's being actively developed and Matanya knows the developer.

TJones updated the task description. (Show Details)Jan 24 2017, 9:18 PM
Restricted Application added a subscriber: Base. · View Herald TranscriptJan 24 2017, 9:18 PM
TJones added a comment.EditedJan 26 2017, 5:29 PM

While researching analyzers, I came across others. I didn't really investigate most of them, so this list is just a starting point for anyone who wants to look more closely at any of these.

General
https://www.elastic.co/guide/en/elasticsearch/plugins/5.1/analysis.html (ES 5.1)
list of Elastic Analysis Plugins (internal and 3rd party)—Japanese, several for Chinese, Polish, Ukrainian, Hebrew, Russian, English, Vietnamese, & some technical ones.

Polish
See T154516.

Chinese
See T158202.

Ukrainian
See T160105.

Hebrew
See T162739.

Japanese
https://www.elastic.co/guide/en/elasticsearch/plugins/5.1/analysis-kuromoji.html (v5.1.2)
https://www.elastic.co/guide/en/elasticsearch/plugins/master/analysis-kuromoji.html (v6.0.0a)
test here (v?): http://www.atilika.org/

Vietnamese
https://github.com/duydo/elasticsearch-analysis-vietnamese (3 months)
linked by Elastic

Thai
https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu.html
ICU Anlaysis plugin, "including better analysis of Asian languages"
Mentioned elsewhere that it covers Thai as well.

Phonetic analysis
https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-phonetic.html (v5.1.2)
https://www.elastic.co/guide/en/elasticsearch/plugins/master/analysis-phonetic.html (v6.0.0.a)
“Soundex, Metaphone, and a variety of other algorithms”, presumably English

Misc
https://github.com/yakaz/elasticsearch-analysis-combo (2 years)
combines multiple language analyzers

TJones updated the task description. (Show Details)Feb 14 2017, 9:56 PM
TJones updated the task description. (Show Details)Feb 15 2017, 3:54 PM
mxn added a subscriber: mxn.Apr 11 2017, 7:15 AM
TJones updated the task description. (Show Details)Apr 11 2017, 7:46 PM
TJones renamed this task from [Epic, Q3 Goal] Research, test, and deploy new language analysers to [Epic, Q3 Goal, Q4 Goal] Research, test, and deploy new language analysers.Apr 11 2017, 9:01 PM
TJones updated the task description. (Show Details)Fri, Jun 23, 6:41 PM