Page MenuHomePhabricator

Display a warning when entering Zawgyi-encoded Burmese
Open, Needs TriagePublic

Description

Could the Wikidata frontend display a warning when users enter strings in Zawgyi encoding?

Background: In Myanmar, the majority of users use legacy keyboards that don’t use proper Unicode encoding. Instead, they use a pseudo-Unicode encoding that is almost like Unicode; only the Myanmar codepoints have non-standard semantics. Web browsers transmit such strings to web servers as structurally valid Unicode text, typically in UTF-8 encoding. On computer systems that implement Unicode according to spec (i.e., almost all computer systems outside Myanmar), Zawgyi-encoded text appears as illegible garbage. This is a notorious problem for all major web services, and sadly it won’t go away anytime soon (complicated story). The Unicode Myanmar FAQ recommends to catch Zawgyi as early as possible, and to store all text in proper Unicode inside the backend database. To my knowledge (which might be wrong), the Burmese Wikipedia has a group of users who manually look for Zawgyi, and manually correct mistakes in Wikipedia articles.

For Wikidata, my proposal would be to detect and discourage Zawgyi in the Wikidata user interface. When a user enters a string, Wikidata would run a Zawgyi detector. If the text has a high Zawgyi likelihood, the UI would then display a warning symbol, perhaps similar to constraint violations. Unfortunately, Zawgyi can’t be 100% reliably detected, especially not on very short strings. So users should still be able to enter a string that gets flagged. But a warning would help; users can then switch to a proper Unicode keyboard.

To detect Zawgyi, you could use Google’s open-source Zawgyi detector which is available in multiple porgramming languages. If requested by Wikimedia, I could imagine that Google might port this detector to additional programming languages. (However, I don’t work at Google anymore, so I can’t actually promise this). One option could be to run the detector on the web client, using the JavaScript library; another option could be to run the detector on the server, returning Zawgyi detection status similar to other validation errors. The detection library does not transmit data to Google; you can verify this in the source code.

Zawgyi detection would only need to be called on strings with Unicode characters in the Myanmar code block. This is very fast to check, so the latency impact would be zero for most users.