Page MenuHomePhabricator

Review Estonian Morphological Libraries
Closed, ResolvedPublic


Based on research in T171652, look at the following in more detail as possible candidates for creating Elasticsearch language analyzer plugins.

Event Timeline

TJones added subscribers: Gehel, dcausse.

I took a closer look at the Vabamorf repository, and it was more complex than my earlier survey revealed. The core of it is in C++ and the Java integration is just a JNI wrapper. Based on discussions with @Gehel and @dcausse, the complexities of any JNI implementation—some of which include the lack of multi-architecture support, increased complexity of garbage collection, and the need to punch holes in standard Elastic security—make it a non-starter. The code itself is also very complicated and does not look to be readily ported to Java.

I did not evaluate the linguistic performance of the stemmer. To re-iterate a lesson I've learned over the course of working on language analyzers: the linguistic quality of the morphological analyzers is far from the only important factor in deciding whether to pursue something. Licenses are of huge importance, and, as in this case, programing language/architecture/technical details matter a lot, too.

I've added looking into building a stemmer from linguistic data (for Estonian or other languages) to my 10% project pile, but I don't expect to see anything from that soon, even if I start working on it as my next project.

debt subscribed.

Thanks for looking, @TJones