Based on research in T171652, look at the following in more detail as possible candidates for creating Elasticsearch language analyzer plugins.
- Vabamorf https://github.com/Filosoft/vabamorf
Based on research in T171652, look at the following in more detail as possible candidates for creating Elasticsearch language analyzer plugins.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Invalid | None | T174065 [FY 2017-18 Objective] Improve support for searching in multiple languages | |||
Open | None | T154511 [Tracking] Research, test, and deploy new language analyzers | |||
Resolved | TJones | T171652 Language Analysis Morphological Library Research Spike | |||
Resolved | TJones | T178928 Review Estonian Morphological Libraries |
I took a closer look at the Vabamorf repository, and it was more complex than my earlier survey revealed. The core of it is in C++ and the Java integration is just a JNI wrapper. Based on discussions with @Gehel and @dcausse, the complexities of any JNI implementation—some of which include the lack of multi-architecture support, increased complexity of garbage collection, and the need to punch holes in standard Elastic security—make it a non-starter. The code itself is also very complicated and does not look to be readily ported to Java.
I did not evaluate the linguistic performance of the stemmer. To re-iterate a lesson I've learned over the course of working on language analyzers: the linguistic quality of the morphological analyzers is far from the only important factor in deciding whether to pursue something. Licenses are of huge importance, and, as in this case, programing language/architecture/technical details matter a lot, too.
I've added looking into building a stemmer from linguistic data (for Estonian or other languages) to my 10% project pile, but I don't expect to see anything from that soon, even if I start working on it as my next project.