Page MenuHomePhabricator

Implement language converter for Malay (from ms-Latn to ms-Arab)
Open, Needs TriagePublic

Description

Currently Malay Wikipedia only uses Roman (ms-Latn) script. We wished to be able to use Jawi (ms-Arab) script (derived from Arabic script). The code ms-Arab is based on (http://en.wikipedia.org/wiki/ISO_15924) and it is widely used in Wiktionary especially English Wiktionary.

We want to have a Jawi option just like what Kazakh Wikipedia has currently. There was a Jawi converter tool in WMF Labs that turns Malay Wikipedia into Jawi script. However, it is inactive and you can see it from here (http://web.archive.org/web/20170811111850/http://tools.wmflabs.org/jawi/wiki/Laman_Utama)

Thank you in advance.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I am here to show support for this. We want to be able to use Jawi on Malay Wikipedia as well, just like what some of the Wikipedia languages that had option to change between their different writing scripts.

Actually, we previously attempted to convert from the Roman Malay script (also known as Rumi script) to the Arabic Malay script (also known as Jawi script) at https://ms.wikipedia.org/wiki/Wikipedia:WikiProjek_Jawi. A Malay Wikipedia user named Kurniasan assisted us in creating the code for conversion, as shown on the page:

https://ms.wikipedia.org/wiki/Pengguna:Kurniasan/rumikpdjawi.js
https://ms.wikipedia.org/wiki/Pengguna:Kurniasan/kamusrumikpdjawi.js

And below are the attempts that I made by myself:

To implement on Vector skin: https://ms.wikipedia.org/wiki/Pengguna:Hakimi97/vector.js
The core Javascript coding for unidirectional Rumi-to-Jawi conversion: https://ms.wikipedia.org/wiki/Pengguna:Hakimi97/rumikpdjawi.js
The coding for a dictionary consist of core words (either roots or stems), prefixes and suffixes (based on Kamus Dewan Perdana, but is no longer being updated due to the large size making it difficult to store on simply user script): https://ms.wikipedia.org/wiki/Pengguna:Hakimi97/kamusrumikpdjawi.js

The concept of word-to-word conversion (as opposed to letter-to-letter conversion, since Rumi and Jawi cannot be matched on a one-to-one basis), and with the affix-to-affix conversion (for Rumi-Jawi conversion, this pertains more to prefix and suffix changes) can be demonstrated using these scripts.

However, integrating the scripts into LanguageConverter and optimizing them for smooth and efficient operation will require significant technical assistance. Please note that the code was developed more than 10 years ago, so a major upgrade is necessary. Additionally, the current JavaScript implementation only performs unidirectional conversion from Rumi script to Jawi script. While a bidirectional conversion would be beneficial, there are issues related to homographs—specifically, different Jawi words with the same Rumi script and vice versa. Perhaps, we could separate the word-to-word dictionary into "default dictionary (DD)", "alternative dictionary for Rumi homograph (ADRH)" and "alternative dictionary for Jawi homograph (ADJH)", with the rules that by default DD will be always actively converted, while ADRH and ADJH will only be converted if there is a request from the target page to parse the word. For example, the word "akrab" (ms-Latn) in Malay language could match with "اقرب" (ms-Arab, meaning "close (in terms of relationship)" and "عقرب" (ms-Arab, meaning "Scorpius constellation"). By default, "akrab" will be converted to "اقرب" automatically through DD because "اقرب" is more commonly used in Malay language, but then there should be a wiki-markup that allows the conversion of "akrab" at Malay Wikipedia pages (regardless of namespaces) into "عقرب" by passing through ADRH instead of DD. The same thing for Jawi homograph, there should be a wiki-markup that allows users to convert through ADJH.

Hi everyone,

I would like to provide an update regarding the Malay language converter.

I have developed a Javascript converter prototype for the Malay language that fetches Wikidata lexicographical data for both ms (Rumi script) and ms-arab (Jawi script) through Wikidata Query Services (WDQS). There are currently two types of Wikidata-based converters:

The first link represents a merged version of Wikidata-based converter (merging content script converter and interface language converter):
https://ms.wikipedia.org/wiki/Pengguna:Hakimi97/penukar-rumi-jawi-wikidata-gabung.js

The second link represents a separated version of Wikidata-based converter (separating content script converter and interface language converter):
https://ms.wikipedia.org/wiki/Pengguna:Hakimi97/penukar-rumi-jawi-wikidata-pisah.js

For the Malay language, I have prioritized the conversion process based on the following criteria:

  1. Process the no-convert (Templat:Kekal Rumi) and explicit-convert (Templat:Tukar Jawi) templates. These two templates are to prevent the conversion of certain Rumi text and to handle homograph conversion, respectively.
  2. All numbers (including those with decimal separators, decimal points, or percentage signs) should remain unchanged throughout the text.
  3. Conversion of multi-word entries (including phrases, idioms etc).
  4. Conversion of words with apostrophes.
  5. Conversion of hyphenated words.
  6. Conversion of individual words (regardless of punctuation).
  7. Special rules for converting "ke" and "di": When "ke" and "di" appear in isolation, their Jawi forms "ک" and "د" should be contextually linked to the following words (with the lexical category "noun"). If the following word starts with alef (ا), the alef should be converted into Alef with Hamza Above (أ).
  8. Punctuation conversion: applies only to commas (excluding decimal separators), semicolons, and question marks.
  9. Special hamza utility to convert relevant hamza (onset and medial positions) to three quarter hamza.

The details on how to use the converter is documented on the following page:

https://ms.wikipedia.org/wiki/Wikipedia:WikiProjek_Penukar_Tulisan/Cara_menyumbang

However, as explained on this page (https://ms.wikipedia.org/wiki/Wikipedia:WikiProjek_Penukar_Tulisan/Perkembangan_teknikal), the current converter has several limitations:

  1. It is a unidirectional converter.
  2. It cannot handle homographs. (Solved since May 2025 Update, utilizing Templat:Tukar Jawi).
  3. It does not fully support the three-quarter hamza. (Fully supported through span since the second June 2025 Update)
  4. The user interface language preference (from ms to ms-arab) requires a separate button labeled "Penukar antara muka". Merged since the first June 2025 update, but can choose to use the separated version.
  5. Handling of initialisms is yet to be decided. (should be added to the lexeme form as "shortform". If there is any conflicting homograph shortforms then use Templat:Tukar Jawi)
  6. It is not developed to be compatible with the Wikipedia mobile apps developed by the Wikimedia Foundation.
Winston_Sung renamed this task from LanguageConverter for Malay (from ms-Latn to ms-Arab) to Language converter for Malay (from ms-Latn to ms-Arab).Apr 8 2025, 5:45 AM
Restricted Application added a subscriber: alaa. · View Herald TranscriptApr 8 2025, 5:45 AM
Winston_Sung renamed this task from Language converter for Malay (from ms-Latn to ms-Arab) to Implement language converter for Malay (from ms-Latn to ms-Arab).Apr 8 2025, 5:46 AM

Hi everyone,

I would like to know regarding what are the steps or roadmap to implement the Malay language script converter. This is because firstly I could not find the detailed guideline on implementing Language Converter for languages with multiple writing systems. To be honest our Malay (ms) language community is especially clueless regarding this matter. Secondly, there might be an urgent need for the Malay language script converter, which could be indicated through the ongoing request to create another separate Malay Wikipedia for Jawi script: https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Standard_Malay_(Jawi) Personally, I do not see the need to create an entirely new wiki solely for the Malay Jawi script. However, I understand the frustration of certain users due to the lack of a platform to contribute in Malay Jawi script, which has led them to pursue the creation of a Malay Wikipedia in the Jawi script.

After consulting with @Taufik , I would like to ping the Language Converter members (@aude @liangent @Mjbmr @Nemo_bis @Chiefwei @Liuxinyu970226 @cscott @Taiwania_Justo @SunAfterRain @Winston_Sung ) as well as @MF-Warburg and @Maor_X to discuss about related matters.