Malayalam is a bit tricky when it comes to translating Unicode
sequences into characters. The current version of the Unicode
standard (17.0) devotes quite some detail to this in Chapter 12.9.
The issue I report here is, that the so-called Chillu consonant forms
are not rendered correctly in the recoding tool and in the word lists
when the chillu form was stored in the Unicode legacy encoding for
Chillu symbols. This legacy encoding defines a three character
sequence for representing a Chillu characters, and it is not obsolete.
This is explained in more detail in the relevant subsection of
Chapter 12.9 of the Unicode standard.
Moreover, obtaining the chillu form in this legacy encoding is the
only possibility with the mal and mal(lalitha) keyboards in linux,
where one has to type the three character sequence CK, VIRAMA, ZWNJ,
where CK is the consonant to be represented in the chillu form
VIRAMA is "chandrakala" character, and ZWNJ is the zero-width
non-joiner. The five possible chillu forms that can be encoded in
this way are shown in Table 12-42 of the Unicode standard.
The following screenshots illustrate the issue with the word അവാന് (he, distal form) (note that the wiktionary page stores the modern form of the chillu character, here is a copy-paste ready variant of the legacy variant: അവന്):
- The first screenshot shows the situation after entering the word with the legacy encoding in the input box, but before pressing enter. The Chillu character is rendered correctly in the input box.
- The second screenshot shows the situation after pressing enter. The chillu variant of the last letter is not rendered correctly. Only a fallback variant of ന with the symbol ് on top is shown.
- The third screenshot shows the situation in the recording tool, which is the same as in the second screenshot.
The legacy Lingua Libre Recorder handles this without problems. In
fact, the uploaded file and also the wikimedia commons page then have
changed transparently the Chillu encoding from legacy style to modern
style.
Having dealt with similar issues in the past in other contexts, my
recommendation would be to use either a font that renders the legacy
sequence correctly (e.g. Noto Sans/Serif Malayalam), but better would
be to change the encoding of such cases on the fly (e.g., via a
lookup-table).


