Page MenuHomePhabricator

Create Serbian Elasticsearch Plugin/Analysis Chain Using Serbian Morphological Libraries
Closed, ResolvedPublic

Description

The consensus is that SCStemmers "Stemmer #4" works well for Serbian Wikipedia and Wiktionary text after a Cyrillic-to-Latin mapping upgrade. It doesn't lose any tokens, and it handles non-Serbian text well.

So, the next step is to create an Elasticsearch plugin based on the stemmer and test that out as part of an analysis chain. (Other elements of the analysis chain include folding accents used primarily to mark pitch accent, which are used in dictionary and encyclopedia entries, but not normal text, and probably additional general ICU folding.)

If it's a success for Serbian, the follow up would be to test it for Croatian, Serbo-Croatian, and Bosnian wikis.

Event Timeline

TJones triaged this task as Medium priority.Dec 15 2017, 5:48 PM
TJones created this task.

Change 415788 had a related patch set uploaded (by Tjones; owner: Tjones):
[search/extra-analysis@master] [WIP] Initial commit of extra-analysis plugin

https://gerrit.wikimedia.org/r/415788

Change 417299 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/extensions/CirrusSearch@master] Enable Serbian-Specific Analysis Chain

https://gerrit.wikimedia.org/r/417299

  • The Serbian stemmer plugin (in the new search/extra-analysis plugin) is just about ready, in Change 415788 above.
  • The Analysis config to use it, with additional analysis chain config, including diacritic folding, is in Change 417299 above.
  • My write up of my analysis chain analysis is available on MediaWiki.
    • Summary: enabled ICU folding with Serbian exceptions, and it works well with the stemmer, nothing unexpected.

Change 415788 merged by jenkins-bot:
[search/extra-analysis@master] Initial commit of extra-analysis plugin

https://gerrit.wikimedia.org/r/415788

Change 417299 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Enable Serbian-Specific Analysis Chain

https://gerrit.wikimedia.org/r/417299