Page MenuHomePhabricator

Review Applying Indonesian Analysis Chain for Malay
Closed, ResolvedPublic

Description

After working on Serbian (T178926/T192395) and Slovak (T178929) and looking at the papers they were based on or translated from, I decided to reconsider what counts as "implementable" for Malay, and review the papers on Malay stemming and compare it to the existing Indonesian analysis.

My understanding of Indonesian and Malay was pretty simple, and that they are "more distinct than American and British English, but less distinct than Spanish and Portuguese". Also, Malay and Indonesian didn't interact in my investigation into fallback languages, where each is used as a fallback language for other languages.

However, looking at the wiki page on the matter, and reviewing some other sources, it seems that a lot of the difference is in Dutch-influenced vs English-influenced spelling of certain sounds, Dutch vs English loanwords, other vocabulary differences, and some pronunciation differences—all of which can decrease mutual intelligibility—but the grammar of the two standard forms seems to be essentially the same.

I also compared the Malay stemmer papers with the Lucene Indonesian stemmer implementation, and verified that they are working on similar affixes. There are some discrepancies, but the core affixes are the same, and the differences seem to come down to what affixes to try to account for (some derivational vs inflectional).

While it's possible that spelling differences or vocabulary differences could increase the error rate for Malay vs Indonesian, it seems to be worth testing; if it is successful, all we need to do it configure it—everything is not only already built, it's already installed, too!

Event Timeline

TJones triaged this task as Medium priority.Jun 8 2018, 8:23 PM
TJones created this task.

Full write up is on MediaWiki.

Generally, it looks good, and if we unpack the Indonesian analyzer in order not to lose the current ICU normalization, we should apply the unpacked version to Indonesian-language wikis, too.

Next steps:

  • Get speaker review of the stemming groups.
  • Assuming the review is positive, commit the unpacked ICU-normalizing config for Malay and Indonesian.
  • Once the config is deployed, reindex Malay- and Indonesian-language wikis with the new config.
Vvjjkkii renamed this task from Review Applying Indonesian Analysis Chain for Malay to 7cbaaaaaaa.Jul 1 2018, 1:05 AM
Vvjjkkii removed TJones as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from 7cbaaaaaaa to Review Applying Indonesian Analysis Chain for Malay.Jul 2 2018, 1:07 PM
CommunityTechBot assigned this task to TJones.
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

Change 446900 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/extensions/CirrusSearch@master] Set up analysis config for Malay and Indonesian

https://gerrit.wikimedia.org/r/446900

Change 446900 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Set up analysis config for Malay and Indonesian

https://gerrit.wikimedia.org/r/446900