Page MenuHomePhabricator

Investigate using a better stemmer & stopwords for Portuguese wikis
Open, HighPublic

Description

User Story: As a Portuguese searcher, I want to have the best stemmer available so I get as many (correct) related forms of words as possible when I search (without quotes) to improve recall and ranking.

Notes
While working on unpacking the Brazilian Portuguese analysis chain (T325092) I decided to compare it to the Portuguese analysis chain, since the Brazilian and European versions of Portuguese are not wildly different (especially in formal written form). The stemmers are very different. After a very brief investigation, I think the brazilian stemmer is possibly better, but that needs to be verified.

However, the Portuguese stemmer comes in four flavors: light_portuguese (the currently used one), minimal_portuguese, portuguese, and portuguese_rslp. These should all be reviewed and the best one for on-wiki searching used.

The brazilian stopword list is also fairly different from the portuguese stopword list, and we should use whichever is better (which may actually be a combination of the two).

Acceptance Criteria:

  • After review with a fluent speaker or speakers:
    • explain and document whether brazilian or another option provides better stemming than light_portuguese, or not
    • document what stopword list is better (one, the other, or a combination)
  • If any changes are warranted, update AnalysisConfigBuilder to do the better thing for both portuguese and brazilian configs.