Page MenuHomePhabricator

Normalize homoglyphs in mixed-script tokens when possible
Open, MediumPublic

Description

Oзон and Озон look the same, but the first one starts with a Latin O rather than a Cyrillic О. Searching for either will not find the other. These errors are not common, but they do occur on many wikis.

We can attempt to map homoglyphs (characters that look the same, like O and О) in mixed-script tokens and additionally index any single-script variants we can generate.


Original Title: Russian characters not normalized to same form in search

Original Description:
These look the same, or at least render the same, but only one of them returns results:

a: Oзон
b: Озон

a: no results
https://ru.wikipedia.org/w/index.php?search=~O%D0%B7%D0%BE%D0%BD&ns0=1
b: has results
https://ru.wikipedia.org/w/index.php?search=~%D0%9E%D0%B7%D0%BE%D0%BD&ns0=1

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 6 2019, 11:26 PM

First string includes two characters which are not cyrillic but latin charset.
List of cyrillic characters in contemporary Russian which have the same/similar pronunciation as in latin would be: АаЕеКМОоТ

debt added a subscriber: debt.

@TJones this might be an interesting thing to look at! :)

I'm showing only the first O is Latin in (a), but the effect is the same—it gets no results. (Search for Latin "O" on the page and the first character of (a) will be highlighted.)

It's not really a normalization problem because the characters are not in the same character set, and we wouldn't want to generally normalize across character sets.

I have a 10% project planned to work on a plugin to look for tokens with mixed character sets, "project" them from Latin to Cyrillic, and Cyrillic to Latin, and then keep any projected tokens that come up as only one character set. So in this case, the projection to Latin wouldn't work out because lowercase "н" doesn't correspond to anything in Latin (there is actually a small Latin Capital ʜ, but it is rarely used). The projection to Cyrillic of (a) works, though, and gives an all-Cyrillic token, so I'd index (a) as both (a)—just in case it was intentional—and as (b). I haven't gotten around to it. I can re-purpose this ticket for that plugin.

Here's the list of Cyrillic characters that map to Latin characters that I've run into (I search for homoglyphs and correct them sometimes on my volunteer account for "fun"). I include "к" because people use it, even though it doesn't look exactly like "k", and some, especially "Ԛ" and "Ԝ" are only convincing as normal Latin characters if you have the right fonts (which I do not on Phabricator). The set "а́е́і́о́у́" are composed (plain character "а" + combining   ́ ), but I do see them used for the precomposed Latin analogs.

аАӑӐӓӒӕӔВсСҫҪеЕѐЀёЁӗӖәӘНіІїЇјЈкКМоОӧӦрРԚѕЅТԜхХуУӯӱа́е́і́о́у́ћз

I'm also planning to work on Latin/Greek, Greek/Cyrillic, and other pairs of scripts with homoglyphs. I shudder to think whether there are any words with three or more character sets used, but at this point I wouldn't be surprised.

I shudder to think whether there are any words with three or more character sets used, but at this point I wouldn't be surprised.

And, of course there is one on English Wikipedia (until someone fixes it): Kиïвсьκa—mostly Cyrillic, but the K and ï are Latin, and the κ is Greek.

I found a few more candidates on Russian Wikipedia, too, including Валерiϊвна—mostly Cyrillic, but i is Latin and ϊ is Greek. The tail on the Greek iota can be missing, depending on the font. I see it here, but not on Russian Wikipedia.

TJones renamed this task from Russian characters not normalized to same form in search to Normalize homoglyphs in mixed-script tokens when possible.May 18 2019, 8:30 AM
TJones claimed this task.
TJones triaged this task as Medium priority.
TJones updated the task description. (Show Details)

I'll start with Latin/Cyrillic for the hackathon, and then try to add Greek (covering Latin/Greek, Greek/Cyrillic, and maybe all three at once), and then look into other potential homoglyph script pairs.

I made a little progress. I struggled with Java and while I was the underdog, I made a bit of progress. Shifting this to a 10% project now, so I'll work on it in fits and starts in the coming months.

Mstyles claimed this task.Feb 10 2020, 6:43 PM
Mstyles added a subscriber: Mstyles.

Change 571616 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[search/extra@master] Add homoglyph plugin

https://gerrit.wikimedia.org/r/571616

Change 571616 merged by Gehel:
[search/extra@master] Add homoglyph plugin

https://gerrit.wikimedia.org/r/571616

Mstyles added a comment.EditedApr 13 2020, 10:44 PM

From the analysis chain analysis comparing the chain with and without the homoglyph token filter on a sample of 10,000 random articles for each language:

Russian was the most impacted languages during testing with 1,064 new tokens added with the plugin from a sample of 2,911,553 tokens (0.037%)
Serbian had 154 new tokens generated out of a sample of 1,396,669 tokens (0.011%)
Polish had 32 new tokens generated out of a sample of 1,559,745 tokens (0.002%)
English had 30 new tokens generated out of a sample of 3,165,891 tokens (0.001%)
French had 7 new tokens generated out of a sample of 2,711,550 tokens (0.000%)

I also took a look at the comparisons that @Mstyles generated, focusing on the new tokens created, and the new collisions (i.e., words that are newly grouped with other words).

For English, all the new tokens are either all Cyrillic or all Latin, so that's good. There are only 8 new collisions in this sample, which are all like Frӧbel/Fröbel and Алeксандрович/Александрович, which is exactly what we want.

For French, we have a couple of weird mixed Cyrillic/Greek tokens being generated from a mixed Latin/Cyrillic/Greek token, which is weird but fine. There are no new collisions in this sample, so the impact on the full Wikipedia will be small, but it should be positive.

For Polish, we have a one mixed Cyrillic/number token, generated from a mixed Latin/Cyrillic/number token, which is good. There are only 12 new collisions, and like the English sample, they are all the kind we'd want: Kozerodа/Kozeroda and комiтет/Комітет.

For Russian, we have a comparatively large number of mixed Latin/Cyrillic/number tokens that generate Latin/number or Cyrillic/number tokens, but that's fine. Russian has a lot more collisions—347—but they are all of the expected type: Сhristopher/Christopher and Беларусi/Беларусі.

For Serbian, we have about 30 unexpected mixed-script tokens! Some are homoglyphs and some are not. Because Serbian has both Cyrillic and Latin alphabets, and both are used on the wiki (with automatic transliteration between them available), we convert all Cyrillic text into Latin text as part of the stemming process, because the actual stemming only works on Latin text.

Some of the source tokens for these are non-Serbian, like Belarusian "Блакiтная", which uses Cyrillic і instead of и. Serbian uses Cyrillic и and Latin i, so it's often easier for Serbian writers to type the Latin variant, and thus we get a mixed-script input like Блакiтная. However, Serbian doesn't have я, so when Блакiтная is converted to Latin, we get blakitnaя., When we convert to a Cyrillic i in Блакітная, we get the transliterated blakіtnaя, with two Cyrillic characters. This is actually okay, because now both Блакiтная and blakіtnaя will have an underlying token in common at search time and will be able to find each other, even if their internal representation is a bit weird. That was the goal all along.

The Serbian sample had 43 new collisions and, desipte the weird tokens, they are all of the desirable type: Соw/Cow and Беларусi/Беларусі.

In general, multi-script languages that use the two scripts that we are testing for homoglyphs may sometimes generate these kinds of weird tokens, but they aren't any worse than existing multi-script tokens, and they are relatively small in number, at least in the Serbian sample.

Added Maryum and my blurbs to my Notes pages for future reference.

Change 593833 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[operations/software/elasticsearch/plugins@master] increment extra plugin to 6.5.4-wmf-9

https://gerrit.wikimedia.org/r/593833

jayantanth removed a subscriber: jayantanth.

Change 593833 merged by Ryan Kemper:
[operations/software/elasticsearch/plugins@master] increment extra plugin to 6.5.4-wmf-9

https://gerrit.wikimedia.org/r/593833

Change 604221 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[mediawiki/extensions/CirrusSearch@master] Add homoglpyh plugin to French

https://gerrit.wikimedia.org/r/604221

Change 604221 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Add homoglpyh plugin to French

https://gerrit.wikimedia.org/r/604221