Page MenuHomePhabricator

Null or inconsistent search results using Khmer script
Open, NormalPublic

Description

This idea was submitted to the current Inspire Campaign focused on new readers detailing an issue related to an inability for users in Cambodia to search Wikipedia using their native written language, Khmer. Here in an excerpt from the idea page itself (bolded text is from me) describing the issue in more detail:

For example, a word meaning 'eat' is: ញ៉ាំ pronounced nyarm.
It can be written with the following keystrokes. Note that the script on the left looks the same regardless. Note also that uppercass is achieved by SHIFT + keystroke, so the symbol " is created by SHIFT + '
ញ៉ាំ J"am -- Note that only this first spelling gets any results on Wikipedia, for the definition in the sister project Wikitionary.
ញុាំ JuaM
ញាុំ JauM
ញំុា JMua
ញំាុ JMau
Even though the scripts on the left look the same, if they are pasted into Wikipedia Search as a search terms, each version will generate completely different results. I am guessing that this is because Wikipedia indexes and searches based on the unicode sequence, not the resulting script.

To provide some concrete examples with links to search results on on Khmer Wikipedia (km.wikipeida.org):

  • Search using ញ៉ាំ (one way to write the word for "eat"): search 1, providing appropriate results
  • Search using ញាុំ (another way to write the word for "eat"): search 2, providing one search result not relevant to the meaning of the search term
  • Search using ញំាុ (another way to write the word for "eat"): search 3, a null result.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 25 2018, 9:47 PM
I_JethroBT updated the task description. (Show Details)

Adding Trey (@TJones) to this task in case any related search work might be useful in ascertaining the cause of the problem or what approaches might help address it.

Restricted Application added a project: Discovery. · View Herald TranscriptJan 26 2018, 2:27 AM

Screenshot of the task description of this very task; Firefox 58 on Fedora 27:


Developer Tools output for <p> element in the "Fonts" section says (apart from "Lato" font for Latin script):

Khmer OS System system
Used as: "Khmer OS System"

$:acko\> rpm -qa | grep khmer
khmeros-fonts-common-5.0-23.fc27.noarch
khmeros-base-fonts-5.0-23.fc27.noarch

Yeah, this is definitely relevant to CirrusSearch. I did a quick review of the Wikipedia page on the Khmer script, and dug into one of the sources (Huffman), and "dictionary order" is... complicated. By chance I happened to be sitting 10 feet from @Aklapper when he commented on this, and on his computer they do all look the same (see screenshot above). On my computer (Mac, with a few Khmer fonts installed), they look different!

The Google Noto Khmer fonts do even weirder things, like it doesn't accept some of the orderings and so doesn't combine the characters:

The solution, if people writing in Khmer do not use any canonical ordering to the characters, would be to re-order them according to some standard and indexing/searching with that. That is very much complicated by the fact that Khmer doesn't require spaces between words. This seems to be more than we can handle straightforwardly. I'll look to see if there are any good tools out there.

I'm on OS X 10.10.5, and I've discovered that the rendering of Khmer script varies by font, but even more so the support for the advanced properties of Khmer fonts varies widely by application.

I found and installed the "khmeros" fonts @Aklapper has (thanks!!), which can be found here.

In TextEdit various fonts look like this (click for larger images):

While in Chrome they looks like this:

Firefox looks more or less like Chrome, but with poorer line spacing. Safari looks like TextEdit, and both are unusable.

The Google Noto fonts do reasonably well, but the "au" order isn't rendered quite right, while the "ua" order is. The Khmer OS fonts look good.

So, anyone who is not already familiar with Khmer computing should find some combination of fonts and applications that allows them to render the examples above correctly. For OS X, the Khmer OS fonts and Chrome seem to do the right thing (and after disabling the fonts that don't work, Phab is now showing things correctly!!).

(Not any closer to a solution, but at least now I have a better handle on the problem.)

Might the solution be as simple as implementing a Khmer spell checker in the contribution text area of the wiki that detects Khmer script. And also a Khmer spell checker in the search box?

Even a spell checker of average quality might be better than none.

Another or complimentary option might be to create an algorithm to extend all search terms into all possible spelling sequences, and then combine all of their results into the one results page. This algorithm could be based off the Unicode rendering rules (if such a thing exists) so that only versions that render the same are included in the extended list. By 'rendered the same' I mean to use the fact that only some character sequences function correctly. Incorrect character sequences create space holders or a sequence of characters that is visually different.

ប្រើ (bjr;) looks different to បើ្រ (b;jr), and ើប្រ (;bjr) so would not need to be included in extended search terms. This is because the user could visually see that they have the incorrect key-press sequence by looking at the represented character sequence.

The way the Unicode works with Khmer script is rather brilliant because it manages things like longer descenders when the collection of symbols goes lower or higher than a typical collection. So I'm guessing it has rules to follow, and these rules might help in determining comparable Unicode value sequences.

ង្ក្រ compared to ក្រ

TJones added a comment.Feb 5 2018, 6:31 PM

The spellchecker idea is an interesting one. I don't know what support there is for a Khmer spellchecker, but perhaps people who compute in Khmer already have it active on their computers. I don't know if we could supply a spellchecker, though. And of course people are free to ignore the spellchecker—I ignore mine all the time.

The possibly good news is that the icu_tokenizer, already used on Khmer-language projects, seems to do a reasonable job of tokenizing Khmer, at least into syllables. Interestingly, it seems to ignore non-sensical characters: for ើប្រ, it just ignores the leading " ើ", which seem to have nothing to properly attach to.

... create an algorithm to extend all search terms into all possible spelling sequences, and then combine all of their results into the one results page. This algorithm could be based off the Unicode rendering rules (if such a thing exists) so that only versions that render the same are included in the extended list.

The more direct approach would be to re-order the tokens into some canonical order—like moving all dependent vowels, supplementary consonants, and other diacritics to the end of the word—before indexing them. Then all variants would be interchangeable for search purposes. I don't know what the canonical order would be yet, or whether a good one exists, but my current analyzer analysis tools would make it easy to evaluate—and I could do it offline by re-ordering them with a stand alone tool.

The tokenization was the scariest part, so doing the next step of the analysis is definitely tractable.

@Eltimbalino, do you have any more examples you can provide? A few alternate orderings that render the same (like variants of ញ៉ាំ), and alternate orderings that don't work (like ប្រើ) would be useful for basic testing of any approach.

EBjune triaged this task as Normal priority.Feb 6 2018, 6:27 PM
EBjune added a subscriber: EBjune.

We need to get a sense of the frequency of this issue, and whether there is a canonical order we can compute.

EBjune moved this task from needs triage to Up Next on the Discovery-Search board.Feb 6 2018, 6:28 PM
TJones moved this task from Up Next to later on... on the Discovery-Search board.Nov 13 2018, 6:47 PM
TJones claimed this task.Tue, Aug 20, 3:24 PM