Page MenuHomePhabricator

Review Slovak Morphological Libraries
Closed, ResolvedPublic

Description

Based on research in T171652, look at the following in more detail as possible candidates for creating Elasticsearch language analyzer plugins.

The former is in Python but looks to be easily ported to Java if it is awesome, and the latter is already a Lucene analyzer.

Event Timeline

Full write up is on Mediawiki.

Summary:

  • Both stemmers come from a common source; one has weird and unclear licensing, the other is MIT. It will need to be translated from Python to Java but it is straightforward.
  • I updated my analysis analysis tools to handle an external stemmer better. Yay!
  • There is a "light" stemmer and an "aggressive" stemmer. Both are tested.
  • There is a superlative prefix (equivalent to English "-est") which is tested, too.
  • I've asked for speaker review on the Slovak Wikipedia and Wiktionary village pumps. More to come once that review happens.