Page MenuHomePhabricator

Explore Using Arabic Analysis Chain for Egyptian Arabic and Moroccan Arabic
Closed, ResolvedPublic5 Estimated Story Points

Description

User Story: As a user of Egyptian Arabic (arz) or Moroccan Arabic (ary) wikis, I would appreciate getting better search results from using some or all of the analysis chain used for Arabic (ar).

While working on unpacking Arabic for T294147, I talked to Mike (@MRaishWMF) about some potential ICU folding, and he mentioned some differences between Standard Arabic and Egyptian Arabic and Moroccan Arabic. While there certainly are differences, there are also probably some similarities, such that either or both could benefit from using some parts of the unpacked Arabic analysis chain—at the very least, some of the Arabic-specific normalization should be helpful. Mike has agreed to help with the analysis, too!

Acceptance Criteria:

  • Either Egyptian Arabic (arz) and Moroccan Arabic (ary) wikis are configured to use some of the Standard Arabic analysis chain elements, or we have documented the reasons why it turned out not to be a good idea.

Event Timeline

TJones set the point value for this task to 5.Sep 26 2022, 3:59 PM

Change 844559 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Customize Arabic Analysis for Egyptian and Moroccan

https://gerrit.wikimedia.org/r/844559

Full write up on Mediawiki.

Everything looks good using the standard Arabic analysis chain, and we added almost 130 additional stopwords. About 12% of Egyptian Arabic and 24% Moroccan Arabic words were filtered as stop words. Of the remainder, 1 in 6 Egyptian and 1 in 5 Moroccan words match other words after stemming.

Change 844559 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Customize Arabic Analysis for Egyptian & Moroccan

https://gerrit.wikimedia.org/r/844559