Page MenuHomePhabricator

Improve Slovak Stemmer
Open, Needs TriagePublic

Description

While looking into T223787: Investigate impact of folding diacritics in Slovak, I discovered that some seemingly reasonable and reasonably common suffixes are not handled by the Slovak stemmer. These were not discovered during the earlier review because we usually focus on looking for false positives rather than false negatives.

We can probably improve the stemmer, so we should!

Some examples that I found are documented here.

Event Timeline

TJones created this task.Jul 12 2019, 10:09 PM
TJones moved this task from needs triage to Language Stuff on the Discovery-Search board.
TJones added a comment.EditedAug 8 2019, 8:23 PM

Another note from working on T223787: Investigate impact of folding diacritics in Slovak: Consider adding (probably hard-coding at first) a short exceptions list to prevent unwanted collisions. The only item to add to the list at the moment would be kedy, to keep it from being stemmed as ked (folded keď).

These two words are etymologically related, high-frequency function words (and candidates for stopwords), so it may not matter enough to implement an exception list just for these—but I'm documenting it here for future reference.

Edit: Another one for any stemmer exception list: forms of highly irregular ísť.