Page MenuHomePhabricator

[Research spike, 4 hours] Research Ukrainian language analyzers
Closed, ResolvedPublic

Description

Review Ukrainian Analyzers previously found and look for others. Then, we'll test the analyzers to see if they really are better.

Event Timeline

TJones updated the task description. (Show Details)

Ukrainian
https://www.elastic.co/guide/en/elasticsearch/plugins/5.1/analysis-ukrainian.html (5.1.2)
https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-ukrainian.html (5.2)
Elastic-supported plugin, based on Morfologik.

https://github.com/vhyza/elasticsearch-analysis-lemmagen (1 month)
https://bitbucket.org/hlavki/jlemmagen (2014)
LemmaGen, lemmatization for Ukrainian +14 others, in Java
https://www.linkedin.com/pulse/efficient-search-your-local-language-roman-ora%C4%8D (2016 )
Blog post on using LemmaGen (for Slovene)
Ukrainian files claim to be "free", but I didn't find specific licensing info; based on link from bitbucket.org to Multext-East: http://nl.ijs.si/ME/V4/

https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer (10 months)
Only up to ES 2.2.1; MIT license;
Based on https://issues.apache.org/jira/browse/LUCENE-7287 , this is what became the ES 5 plugin.

https://github.com/vgrichina/elasticsearch-ukrainian-stemmer (4 years)
no license; very old

There's not a bunch out there, and the ES Morfologik plugin seems to be the most popular by far, and would be the easiest to support, so my current plan is to test that, and if it is good, run with it. If not, we can look back here for other options.