Page MenuHomePhabricator

Esperanto Stemmer Updates
Closed, ResolvedPublic

Description

During testing the ES5 plugin, I found a regression and a couple of obvious places for improvement.

  • The strings j, n, and jn have the suffixes j, n, and jn removed, leaving an empty string! The command line stemmer didn't do that. Oops.
  • Some obviously non-Esperanto words are having j, n, and jn removed (mostly n), like barn, mann, heyn, djerdj, etc. j, n, and jn generally follow a vowel in Esperanto words.
  • Numerals should be inflected with a dash, (e.g., 1-oj, 1-a, etc.), but are not always, so we get 1a, 1960j, 1980an, etc. Those are easy to recognize, so we should do the right thing.

Details

Related Gerrit Patches:
search/extra-analysis : 5.xUpdate Esperanto Stemmer
search/extra-analysis : masterUpdate Esperanto Stemmer

Event Timeline

TJones triaged this task as Medium priority.Aug 23 2018, 8:34 PM
TJones created this task.

Change 454935 had a related patch set uploaded (by Tjones; owner: Tjones):
[search/extra-analysis@5.x] Update Esperanto Stemmer

https://gerrit.wikimedia.org/r/454935

Change 454924 had a related patch set uploaded (by Tjones; owner: Tjones):
[search/extra-analysis@master] Update Esperanto Stemmer

https://gerrit.wikimedia.org/r/454924

Updated both ES5 and ES6 branches.

Change 454924 merged by jenkins-bot:
[search/extra-analysis@master] Update Esperanto Stemmer

https://gerrit.wikimedia.org/r/454924

Change 454935 merged by jenkins-bot:
[search/extra-analysis@5.x] Update Esperanto Stemmer

https://gerrit.wikimedia.org/r/454935

debt closed this task as Resolved.Sep 13 2018, 9:07 PM