Page MenuHomePhabricator

Esperanto Stemmer Updates
Closed, ResolvedPublic

Description

During testing the ES5 plugin, I found a regression and a couple of obvious places for improvement.

  • The strings j, n, and jn have the suffixes j, n, and jn removed, leaving an empty string! The command line stemmer didn't do that. Oops.
  • Some obviously non-Esperanto words are having j, n, and jn removed (mostly n), like barn, mann, heyn, djerdj, etc. j, n, and jn generally follow a vowel in Esperanto words.
  • Numerals should be inflected with a dash, (e.g., 1-oj, 1-a, etc.), but are not always, so we get 1a, 1960j, 1980an, etc. Those are easy to recognize, so we should do the right thing.

Event Timeline

TJones triaged this task as Medium priority.Aug 23 2018, 8:34 PM
TJones created this task.

Change 454935 had a related patch set uploaded (by Tjones; owner: Tjones):
[search/extra-analysis@5.x] Update Esperanto Stemmer

https://gerrit.wikimedia.org/r/454935

Change 454924 had a related patch set uploaded (by Tjones; owner: Tjones):
[search/extra-analysis@master] Update Esperanto Stemmer

https://gerrit.wikimedia.org/r/454924

Updated both ES5 and ES6 branches.

Change 454924 merged by jenkins-bot:
[search/extra-analysis@master] Update Esperanto Stemmer

https://gerrit.wikimedia.org/r/454924

Change 454935 merged by jenkins-bot:
[search/extra-analysis@5.x] Update Esperanto Stemmer

https://gerrit.wikimedia.org/r/454935