Page MenuHomePhabricator

Review Estonian Morphological Libraries
Closed, ResolvedPublic

Description

Based on research in T171652, look at the following in more detail as possible candidates for creating Elasticsearch language analyzer plugins.

Event Timeline

TJones moved this task from needs triage to Up Next on the Discovery-Search board.
TJones moved this task from Up Next to Current work on the Discovery-Search board.Jun 5 2018, 5:24 PM
TJones added subscribers: Gehel, dcausse.

I took a closer look at the Vabamorf repository, and it was more complex than my earlier survey revealed. The core of it is in C++ and the Java integration is just a JNI wrapper. Based on discussions with @Gehel and @dcausse, the complexities of any JNI implementation—some of which include the lack of multi-architecture support, increased complexity of garbage collection, and the need to punch holes in standard Elastic security—make it a non-starter. The code itself is also very complicated and does not look to be readily ported to Java.

I did not evaluate the linguistic performance of the stemmer. To re-iterate a lesson I've learned over the course of working on language analyzers: the linguistic quality of the morphological analyzers is far from the only important factor in deciding whether to pursue something. Licenses are of huge importance, and, as in this case, programing language/architecture/technical details matter a lot, too.

I've added looking into building a stemmer from linguistic data (for Estonian or other languages) to my 10% project pile, but I don't expect to see anything from that soon, even if I start working on it as my next project.

debt closed this task as Resolved.Jun 14 2018, 8:01 PM
debt added a subscriber: debt.

Thanks for looking, @TJones