Page MenuHomePhabricator

Add support for Jawi Hamza
Open, MediumPublic

Assigned To
None
Authored By
EmpAhmadK
Sep 15 2024, 4:18 PM
Referenced Files
F57512803: unkown.png
Sep 15 2024, 4:18 PM
F57512801: tirmizi.png
Sep 15 2024, 4:18 PM
F57512824: kasim3.png
Sep 15 2024, 4:18 PM
F57512799: kasim2.png
Sep 15 2024, 4:18 PM
F57512797: kasim1.png
Sep 15 2024, 4:18 PM
F57512795: dahaman & ahmad.png
Sep 15 2024, 4:18 PM
F57512793: nixon.png
Sep 15 2024, 4:18 PM
F57512791: hamzah.png
Sep 15 2024, 4:18 PM

Description

The Jawi script is a modified Arabic script used in the Malay language (ms_Arab). It is an abjad-alphabet hybrid, whereby it uses the base abjad system from Arabic, but applies certain alphabetic rules such as using certain vowel letters. It also modifies or repurposes certain letters for different usages, or add new variants to existing letters.

One of the main features of Jawi is its usage of the letter Hamza, which serves as a form of a utility letter rather than simply as a glottal letter as in Arabic. The most notable change is the addition of a new variant of the letter, known in Malay as "hamzah tiga suku/همزة تيݢ سوکو," which can be literally translated as "three-quarter hamza." For the purpose of this task, I will refer to the letter simply as "Jawi Hamza."[1]

The Malay name refers to the letter's visual feature, whereby it takes the form of a standard stand-alone hamza, but alters its vertical position whereby its baseline sits between three-quarter to half the height of alef, depending on scripture style.[2] This letter appears rather common in the Malay dictionary, as it serves multiple purposes, including as a glottal letter (similar but in a different manner as to Arabic) and vowel letter separator. This character has been in use in Malay writing for centuries, and is still used to this day.

As common as this character is, it has not been included in the Unicode Standard. In September 2021, the Kazakh high hamza (U+0674) annotation has been amended to include "Jawi" aside "Kazakh." However, the issue is that the Kazakh high hamza and the Jawi hamza do not share the same visual appearance nor attributes.

First of all, as visible in Kazakh writing,[3] the Unicode standard,[4] and how the character is depicted in most fonts (see Times New Roman, Noto Sans Arabic, Segoe UI, SF Arabic, and IBM Plex Sans Arabic, among others),[5] the Kazakh high hamza is of a smaller size in comparison to the regular hamza, and more similar to the size of a hamza above and hamza below (U+0654 & U+0655).

The glyph is also commonly depicted as being higher than the vertical height of the Jawi hamza, sometimes even as high as the hamza from hamza above.[5]

In the Unicode Standard, it is said that the character "forms digraphs," which is not what the Jawi hamza does. The Jawi hamza is not entirely dependent on any other letter, and could appear in front, between, or after any letter, without the need for any other letter before or after itself. It also does not merge with any other letter as how some fonts depict (although some old manuscripts may depict it as if it merges with some letters, it only seems so due to calligraphic style, whereby some letters are stacked on top of each other to save space or look visually cleaner).

It must be noted that in Jawi, the Jawi hamza must be depicted the same size as a regular hamza, with its baseline positioned between three-quarters to half the height of an alef.[2]

The motion to propose the character to the Unicode Technical Committee (UTC) started in August 2008 during the Internationalized Domain Name Forum, when linguists found the lack of the character in the Unicode Standard while discussing its inclusion to the Jawi coded character set standard. It was proposed in 2009 that the character be proposed to the UTC, however, it is unclear whether the proposal was ever sent.

After the "Jawi" annotation was added to U+0674 in 2021, a proposal to add the Jawi hamza was submitted by Karim et al. in 2022, where they described the difference between the Jawi hamza and the Kazakh high hamza. However, the Script Ad Hoc group (SAH) said that font designers are expected to either employ a language tag or create a Jawi-specific font. (Anderson et al., 2022)

We believe that this is an inappropriate solution, if it could even be considered a solution. It is akin to "employing a language tag or creating a Jawi-specific font" to display the character "Ǎ" as the character "Ă" when writing in Romanian, simply because a breve and caron looks nearly identical, or using the character "²" to display "2" because the earlier is just the smaller version of the same shape. This solution would also mean that it would be impossible to display the Jawi hamza in a text-only word processor such as text messaging applications where they only use one font. It would be even more technically disastrous to quote Kazakh and Jawi in one sentence without changing fonts for the specific word. This defeats the very purpose that the Unicode Standard was created to begin with.

Current temporary solutions for digital applications include various techniques of altering its visual appearance. One way is by creating a custom font, using an unused character which is depicted to look like the Jawi hamza.[6] This does mean, however, that when copying the text or when the font does not load, the original character would be seen in its place. Another way is to create a font that displays the regular Arabic hamza as the Jawi hamza in certain positions or when forced by a zero-width character, or displays the Kazakh high hamza as the Jawi hamza (Airaha, 2023). In this case, the character would be seen as a regular hamza or Kazakh high hamza when viewed without its font. Another way this is handled when writing on websites is to use a span class for the regular Arabic hamza to alter its vertical position, which allows it to remain the same even if any font fails to load.[7] However, it still does not work if the text is copied to another word processor. There are various other ways this is handled, each employing different workarounds that would not be compatible with platforms that do not support any customised formatting, whether through font or other features.

Therefore, we believe that the Jawi hamza deserves its own character code in the Unicode Standard, separate and disunified from U+0674 (the Kazakh high hamza) and independent from U+0621 (the regular Arabic hamza).

Note:

  1. In English, the character has been referred to by multiple names. Khalid (2009) used "Jawi Letter Hamzah Three Quarter." Karim et al. (2022) used "Arabic Letter Three-Quarter High Hamza." However, we propose to use either "Arabic Letter Jawi High Hamza" or "Arabic Letter Jawi Hamza" as how the characters U+06C5 and U+06CC use "Arabic Letter Kirghiz Oe" and "Arabic Letter Farsi Yeh" respectively.
  2. Refer to Figure 1
  3. Refer to Figure 2
  4. Refer to Figure 3
  5. Refer to Figures 4.1 & 4.2
  6. Refer to Figure 5
  7. Refer to Figure 6

Figures:
Figure 1 - Snippet from Ahmad (2015), which says "the hamza letter like above, must be written at the level of the middle or three-quarters of the height of the letter alef. Its size must be big."

ahmad.png (720×738 px, 172 KB)

Figure 2 - Excerpts from Kazakh text showing the Kazakh high hamza in use. The top picture is from a Kazakh-language edition of the Best Chinese short stories of 1978, while the bottom is from a Kazakh edition article from People's Daily.
kazakh.png (720×706 px, 730 KB)

Figure 3 - The Unicode 16.0 chart depicting U+0674 as smaller than U+0621, making it impossible to be used for the purpose of the Jawi hamza. The red lines indicate the heights of U+0674 and U+0654 which are the same, whereas the blue lines indicate the height of U+0621 which should be the height of the Jawi hamza.
unicode.png (1×1 px, 546 KB)

Figure 4.1 - Current state of Kazakh high hamza in most fonts, making it unusable for the purpose of the Jawi hamza.
Screenshot 2024-09-14 184052.png (1×2 px, 147 KB)

Figure 4.2 - How the Jawi hamza should look like in these fonts.
Screenshot 2024-09-14 184117.png (921×2 px, 85 KB)

Figure 5 - The implementation of the Jawi hamza on the web version of Utusan Melayu, a defunct section of the Utusan Malaysia newspaper. Notice the usage of an unused character (U+FBB6) in place of the Jawi hamza as seen in the source code.
hamzah.png (720×1 px, 169 KB)

Figure 6 - The implementation of the Jawi hamza on Wikipedia, where the regular hamza is encased in a span box which has been specified to be positioned higher.
nixon.png (2×3 px, 1 MB)

Figure 7 - Dahaman & Ahmad (1988) explaining the usage of both the Jawi hamza and the regular hamza. This shows that both characters are not interchangeable and that there is a clear difference between them in function and purpose.
dahaman & ahmad.png (721×837 px, 157 KB)

Figure 8 - Kasim (2019), a Jawi text book, showing the five different types of hamza in Jawi. From right to left, the third is the Jawi hamza and the fourth is the regular hamza (alef is written for height reference). Notice that they are placed separately.
kasim1.png (720×983 px, 308 KB)

Figure 9 - Kasim (2019), a Jawi text book, showing some of the unique usages of the Jawi hamza
kasim2.png (667×1 px, 203 KB)

Figure 10 - Kasim (2019), a Jawi text book, instructing the student to transliterate two excerpts from Latin to Jawi. The exercise requires the differentiation between a regular hamza and a Jawi hamza as it uses both.
kasim3.png (2×2 px, 697 KB)

Figure 11 - An excerpt from Tarmizi (2019), which is a news article on Utusan Malaysia, showing some of the recent usages of the letter on print media.
tirmizi.png (840×1 px, 340 KB)

Figure 12 - An old family tree, estimated to be from around the 18-19th century, of the Tok Masjid family. The name of the family itself in Jawi (توء مسجد) requires the usage of the Jawi hamza, but can only be written using the regular hamza as of now.
unkown.png (856×1 px, 775 KB)

Reference:

Event Timeline

This is not the right venue for this. The Wikimedia Foundation has no control over what characters are encoded in Unicode.

As common as this character is, it has not been included in the Unicode Standard.

See https://unicode.org/versions/Unicode16.0.0/core-spec/chapter-9/#G66979 : "Font designers can use language tagging in order to support the preferred shapes for both Kazakh and Jawi in multilingual fonts."

I'm not sure what you expect Wikimedia to do here. Could you please clarify?

Therefore, we believe that the Jawi hamza deserves its own character code in the Unicode Standard

We are Wikimedia. We are not the Unicode Consortium defining the Unicode standard.

@Pppery @Aklapper I was asked by @Aaharoni-WMF and @srishakatux to create this task as per the Language Community Meeting last month. They said that since the Foundation is now a member of the Unicode Consortium, they plan to have this letter included in the standard.

"Font designers can use language tagging in order to support the preferred shapes for both Kazakh and Jawi in multilingual fonts."

The issue is that this will not be compatible for platforms that do not support language tagging or different fonts. For example, if we were to post on social media or text someone in Jawi, there is literally no way to write that character, except if we were to install our own custom font on every mobile phone whose user wishes to write in Jawi. But that would also make the Kazakh text appear wrongly as it is now too large for Kazakh. While we're on this track, why not just remove all instances of non-Latin characters, and use Latin equivalent characters with a custom font to display "abjdr" as "ابجدر"?

Thank you for the report. Do you use this letter in Wikimedia projects? How do you do it?

Thank you for the report. Do you use this letter in Wikimedia projects? How do you do it?

My pleasure. Yes, we use it on a few Wikimedia projects, most notably Malay Wikipedia, but also others such as English Wikipedia, Commons, and also the Wikimania wiki. Basically where we will need to translate certain things to Jawi.

The current implementation is through encasing the hamza in a span box, and changing its style to alter the relative position by a certain amount of distance from the bottom, as seen below:

<span style="botton: 0.26em;position: relative;">ء</span>

The span box is usually called through a template. I have attached Figure 6 for a clearer picture on how it is implemented. It is also used a lot on https://ms.wikipedia.org/wiki/Pemasyhuran_Kemerdekaan_Tanah_Melayu where you can see it being used in the Jawi transcription of the Malayan Declaration of Independence.

As an active user of various Malay Wikimedia projects, as well as a native Malay speaker who has learnt the Jawi script from a very young age, I would like to support the effort made by @EmpAhmadK.

I just noticed that Figures 4.1 and 4.2 were not publicly visible, so I just changed the settings for both files and they can now be seen. These two figures are important as it compares the difference between the Kazakh high hamza and the Jawi hamza.

@EmpAhmadK and all: In a recent conversations with WMF engineers @SToyofuku-WMF, @Ladsgroup, @santhosh they acknowledged that this appears to be a legitimate issue. They looked into your community’s previously submitted proposal to Unicode and are trying to make sense of their suggestion that the responsibility should lie with font designers. They were wondering if the preferable course of action would be to have fonts specifically support the Jawi 3/4 hamza using the U+0674 HIGH HAMZA character when writing in the Jawi script.

It was mentioned that it is common to see a font supporting a script and, when that script is used for multiple languages, to include language-specific variations through additional glyph variations provided via the locl OpenType feature. For example, Devanagari fonts often support Hindi, Marathi, Nepali, and others. Marathi has a different "la" glyph, and instead of creating a separate Marathi font, the same Devanagari font provides the variation using the locl feature. The Arabic script and its language-specific variations follow a similar approach.

The questions for you are:

  • Have you tried seeking font support in general? And, if so, what have been the challenges in getting font support?
  • Have you tried filing bugs in upstream projects like Noto or worked with font designers to implement this? Similar to this issue, perhaps you can consider reporting the issues?
  • What are the fallback fonts in these cases, and could the support not be added to them?

@SToyofuku-WMF, @Ladsgroup, @santhosh Please add if I missed anything from our discussion.

@srishakatux Sorry for he late response since I just started my new semester. Thank you very much for the prompt discussion and response on this matter.

Regarding the questions asked,

  • There are certain fonts which do support the Jawi 3/4 hamza, and this includes Amiri, Reem Kufi, and a few others, even custom fonts by Malay users who alter existing fonts for this purpose. However, the problem with doing this is that it would no longer allow for the use of the Kazakh high hamza, as the character is now displayed as the Jawi hamza. Although it is possible to change the glyph contextually, it would need for the system to recognise whether we are typing in Jawi or Kazakh, which is nearly impossible to implement on text messaging or social media platforms such as WhatsApp, Discord, X/Twitter, and so on. We want to make it so that everyone would see the characters as it should be, whether Kazakh high hamza or Jawi 3/4 hamza, irrespective of what platform they use and what fonts they have downloaded.
  • Some have already filed this to font providers, especially those that publish their fonts on Google fonts. I cannot remember the details of how many were approached and all, but the issue mentioned above still stands.
  • The thing is, most fonts already support U+0674 as it is part of the main Arabic Unicode block, so there are not many that require fallback fonts. The issue is that they would either not depict it as the Jawi 3/4 hamza, or they would depict it as the Jawi variant but no longer able to show the Kazakh high hamza, or at least not in formatless platforms.

The issue is not whether fonts could support the glyph using U+0674 or U+0621. As I mentioned, there are several fonts which have a workaround to support this character, which is how the character has been able to somewhat survive. The issue is that they would be stuck at prioritising either Jawi or Kazakh, and cannot support using the both of them at the same time.

There are even fonts, notably created by Airaha as mentioned in the main post, which uses conditional ruling for U+0621 whereby if it's placed in the positions that the Jawi hamza commonly appears in, it would be displayed as a Jawi hamza. However, it cannot encompass all instances without affecting instances which uses the regular hamza (i.e. the font cannot differentiate between داتوء and وضوء, because both are placed at the end of the word, preceeded by a و, but in the first instance it is supposed to use a 3/4 hamza).

The same goes to when displaying Kazakh high hamza as Jawi 3/4 hamza, there may be instances where the system would confuse Jawi as Kazakh or vice versa whenever the system is not informed on what language the writing is in, for example in social media.

If we want to address this issue only in platforms that supports language specification such as MS Word or Photoshop, the issue is long able to be addressed without even needing U+0674, as we could simply use the regular hamza and apply a vertical position adjustment to the letter, as how a lot of people including myself use. The same goes to in websites or Wikipedia where we have such system in place. But these are temporary solutions, because in the end, the purpose of Unicode is for people to be able to simply press the key to insert the character, and for everyone across different platforms to see the same exact letter show up, no matter what font or system they use or the text is written on.

Restricted Application added a subscriber: alaa. · View Herald TranscriptApr 1 2025, 7:41 PM
MaryMunyoki triaged this task as Medium priority.Apr 9 2025, 4:26 PM
MaryMunyoki removed Amire80 as the assignee of this task.EditedOct 31 2025, 11:28 PM
MaryMunyoki moved this task from In Progress to Backlog on the LPL Onboarding and Development board.
MaryMunyoki subscribed.

Follow up, sub task will be created at some point, for now returning ticket to backlog

I’m removing the LPL Onboarding and Development tag from this task. Currently, we have a work-in-progress proposal developed by Language and Product Localization, in collaboration with @EmpAhmadK. The proposal still needs input from Malay community members before it can be submitted to the Unicode Consortium. Once there’s momentum on this request, it can move forward.