Page MenuHomePhabricator

Create Slovak Elasticsearch Plugin/Analysis Chain Using Slovak Stemming Algorithm
Closed, ResolvedPublic

Description

I'm still waiting to see if I get any more feedback on the Slovak stemming algorithm, but it seems clear that the "light" stemmer does a good job, and stripping the naj- prefix is helpful. Some implementation details could change (also stripping pod- or changing the order of stripping prefixes to be before or after the other inflectional suffixes), but the basic set up is clear.

So, the next step is add Elasticsearch plugin based on the stemmer to search/extra (which is license-compatible) and test that out as part of an analysis chain.

Event Timeline

TJones triaged this task as Normal priority.Mar 27 2018, 2:34 PM
TJones created this task.

Change 423043 had a related patch set uploaded (by Tjones; owner: Tjones):
[search/extra@master] [WIP] Create Slovak Elasticsearch Plugin

https://gerrit.wikimedia.org/r/423043

TJones added a comment.Apr 2 2018, 8:59 PM

The update to the extra/search plugin above is a work in progress because it does not yet contain unit tests. However, I was able to use the plugin to test the full analysis chain. The write up is on MediaWiki. The key points:

  • Elasticsearch stemmer behaved exactly like the command line stemmer.
  • Adding ICU folding, with exceptions for Slovak characters, looks to be a net positive.
  • Next steps: deploy the updated search/extra plugin (when ready), deploy the analyzer config after the plugin is deployed, re-index Slovak-language wikis.

Change 423043 merged by jenkins-bot:
[search/extra@master] Create Slovak Elasticsearch Plugin

https://gerrit.wikimedia.org/r/423043

Change 428395 had a related patch set uploaded (by Tjones; owner: Tjones):
[search/extra@master] Add documentation for slovak_stemmer

https://gerrit.wikimedia.org/r/428395

Change 428395 merged by jenkins-bot:
[search/extra@master] Add documentation for slovak_stemmer

https://gerrit.wikimedia.org/r/428395

debt closed this task as Resolved.May 1 2018, 6:16 PM