Page MenuHomePhabricator

[Research spike, 4 hours] Research Polish language analysers
Closed, ResolvedPublic

Description

To get the ball rolling on our language analyser work in Q3, we'll do some research for Polish language analysers. @dcausse knows of a good analyser for Polish, so we'll do some research to see if we can find anything better than that. Then, we'll test the analysers to see if they really are better.

Event Timeline

@dcausse Can you add a quick note here about that Polish language analyser you found, when you get the chance? Thanks!

It's stempel and is maintained by elastic so it guarantees that the plugin will always be up to date.
Another alternative is https://github.com/monterail/elasticsearch-analysis-morfologik, unfortunately it does not seem to be maintained. This one is based on the same technology (|morphologik) used by the new ukrainian analyzer we'd like to use.

I'd like to add that the study should take into account the flexibility of the plugin, i.e. is it possible to access individual components such as the filter to stem. If the plugin provides only a full featured analyzer we may no be able to plug our custom char filters/ICU filters... It's not strictly required but would be nice plus when it comes to compare multiple candidates.

It's stempel and is maintained by elastic so it guarantees that the plugin will always be up to date.
Another alternative is https://github.com/monterail/elasticsearch-analysis-morfologik, unfortunately it does not seem to be maintained. This one is based on the same technology (|morphologik) used by the new ukrainian analyzer we'd like to use.

Thanks. :-)

I'd like to add that the study should take into account the flexibility of the plugin, i.e. is it possible to access individual components such as the filter to stem. If the plugin provides only a full featured analyzer we may no be able to plug our custom char filters/ICU filters... It's not strictly required but would be nice plus when it comes to compare multiple candidates.

Good idea.

I went looking for morphological analyzers and found some in Polish, and incidentally several others that may be worth noting for later. The only potential new Polish analyzer I found is based on LemmaGen, and it looks like the data file for Polish in particular does not have a compatible license.

Polish
https://www.elastic.co/guide/en/elasticsearch/plugins/5.1/analysis-stempel.html (v5.1.2)
https://www.elastic.co/guide/en/elasticsearch/plugins/master/analysis-stempel.html (v6.0.0a)
Stempel, recommended by Elastic and David.

http://stackoverflow.com/questions/28487630/how-to-use-custom-analyzer-in-elasticsearch (2015)
http://stackoverflow.com/questions/40946997/elasticsearch-polish-analysis-tokenizer-not-found (2016)
http://stackoverflow.com/questions/35039086/setting-analyzer-from-plugin-in-elasticsearch-with-nest (2016)
notes on trouble installing Polish/Stempel analyzer, for reference; haven’t looked at them closely.

https://github.com/monterail/elasticsearch-analysis-morfologik (3 years)
https://github.com/antqa/elasticsearch-analysis-morfologik (1 month)
Morfologik: investigated by David and rejected; antqa version has released for ES 2.4.1, 5.0.0, and 5.1.1

https://github.com/vhyza/elasticsearch-analysis-lemmagen (<2 weeks)
https://bitbucket.org/hlavki/jlemmagen (2014)
LemmaGen, lemmatization for Polish +14 others, in Java
https://www.linkedin.com/pulse/efficient-search-your-local-language-roman-ora%C4%8D (2016 )
Blog post on using LemmaGen (for Slovene)
Some data files have incompatible license, including Polish, based on link from bitbucket.org to Multext-East: http://nl.ijs.si/ME/V4/

Looks like Stempel is the way to go.

Notes on other analyzers I found along the way are posted in T154511#2973329

Deskana closed this task as Resolved.Jan 30 2017, 6:16 PM