Page MenuHomePhabricator

Normalize homoglyphs in mixed-script tokens when possible
Open, NormalPublic

Description

Oзон and Озон look the same, but the first one starts with a Latin O rather than a Cyrillic О. Searching for either will not find the other. These errors are not common, but they do occur on many wikis.

We can attempt to map homoglyphs (characters that look the same, like O and О) in mixed-script tokens and additionally index any single-script variants we can generate.


Original Title: Russian characters not normalized to same form in search

Original Description:
These look the same, or at least render the same, but only one of them returns results:

a: Oзон
b: Озон

a: no results
https://ru.wikipedia.org/w/index.php?search=~O%D0%B7%D0%BE%D0%BD&ns0=1
b: has results
https://ru.wikipedia.org/w/index.php?search=~%D0%9E%D0%B7%D0%BE%D0%BD&ns0=1

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 6 2019, 11:26 PM

First string includes two characters which are not cyrillic but latin charset.
List of cyrillic characters in contemporary Russian which have the same/similar pronunciation as in latin would be: АаЕеКМОоТ

debt added a subscriber: debt.

@TJones this might be an interesting thing to look at! :)

I'm showing only the first O is Latin in (a), but the effect is the same—it gets no results. (Search for Latin "O" on the page and the first character of (a) will be highlighted.)

It's not really a normalization problem because the characters are not in the same character set, and we wouldn't want to generally normalize across character sets.

I have a 10% project planned to work on a plugin to look for tokens with mixed character sets, "project" them from Latin to Cyrillic, and Cyrillic to Latin, and then keep any projected tokens that come up as only one character set. So in this case, the projection to Latin wouldn't work out because lowercase "н" doesn't correspond to anything in Latin (there is actually a small Latin Capital ʜ, but it is rarely used). The projection to Cyrillic of (a) works, though, and gives an all-Cyrillic token, so I'd index (a) as both (a)—just in case it was intentional—and as (b). I haven't gotten around to it. I can re-purpose this ticket for that plugin.

Here's the list of Cyrillic characters that map to Latin characters that I've run into (I search for homoglyphs and correct them sometimes on my volunteer account for "fun"). I include "к" because people use it, even though it doesn't look exactly like "k", and some, especially "Ԛ" and "Ԝ" are only convincing as normal Latin characters if you have the right fonts (which I do not on Phabricator). The set "а́е́і́о́у́" are composed (plain character "а" + combining   ́ ), but I do see them used for the precomposed Latin analogs.

аАӑӐӓӒӕӔВсСҫҪеЕѐЀёЁӗӖәӘНіІїЇјЈкКМоОӧӦрРԚѕЅТԜхХуУӯӱа́е́і́о́у́ћз

I'm also planning to work on Latin/Greek, Greek/Cyrillic, and other pairs of scripts with homoglyphs. I shudder to think whether there are any words with three or more character sets used, but at this point I wouldn't be surprised.

I shudder to think whether there are any words with three or more character sets used, but at this point I wouldn't be surprised.

And, of course there is one on English Wikipedia (until someone fixes it): Kиïвсьκa—mostly Cyrillic, but the K and ï are Latin, and the κ is Greek.

I found a few more candidates on Russian Wikipedia, too, including Валерiϊвна—mostly Cyrillic, but i is Latin and ϊ is Greek. The tail on the Greek iota can be missing, depending on the font. I see it here, but not on Russian Wikipedia.

TJones renamed this task from Russian characters not normalized to same form in search to Normalize homoglyphs in mixed-script tokens when possible.May 18 2019, 8:30 AM
TJones triaged this task as Normal priority.
TJones claimed this task.
TJones updated the task description. (Show Details)

I'll start with Latin/Cyrillic for the hackathon, and then try to add Greek (covering Latin/Greek, Greek/Cyrillic, and maybe all three at once), and then look into other potential homoglyph script pairs.

I made a little progress. I struggled with Java and while I was the underdog, I made a bit of progress. Shifting this to a 10% project now, so I'll work on it in fits and starts in the coming months.