Page MenuHomePhabricator

Normalize homoglyphs in mixed-script tokens when possible
Open, MediumPublic

Description

Oзон and Озон look the same, but the first one starts with a Latin O rather than a Cyrillic О. Searching for either will not find the other. These errors are not common, but they do occur on many wikis.

We can attempt to map homoglyphs (characters that look the same, like O and О) in mixed-script tokens and additionally index any single-script variants we can generate.


Original Title: Russian characters not normalized to same form in search

Original Description:
These look the same, or at least render the same, but only one of them returns results:

a: Oзон
b: Озон

a: no results
https://ru.wikipedia.org/w/index.php?search=~O%D0%B7%D0%BE%D0%BD&ns0=1
b: has results
https://ru.wikipedia.org/w/index.php?search=~%D0%9E%D0%B7%D0%BE%D0%BD&ns0=1

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 6 2019, 11:26 PM

First string includes two characters which are not cyrillic but latin charset.
List of cyrillic characters in contemporary Russian which have the same/similar pronunciation as in latin would be: АаЕеКМОоТ

debt added a subscriber: debt.

@TJones this might be an interesting thing to look at! :)

I'm showing only the first O is Latin in (a), but the effect is the same—it gets no results. (Search for Latin "O" on the page and the first character of (a) will be highlighted.)

It's not really a normalization problem because the characters are not in the same character set, and we wouldn't want to generally normalize across character sets.

I have a 10% project planned to work on a plugin to look for tokens with mixed character sets, "project" them from Latin to Cyrillic, and Cyrillic to Latin, and then keep any projected tokens that come up as only one character set. So in this case, the projection to Latin wouldn't work out because lowercase "н" doesn't correspond to anything in Latin (there is actually a small Latin Capital ʜ, but it is rarely used). The projection to Cyrillic of (a) works, though, and gives an all-Cyrillic token, so I'd index (a) as both (a)—just in case it was intentional—and as (b). I haven't gotten around to it. I can re-purpose this ticket for that plugin.

Here's the list of Cyrillic characters that map to Latin characters that I've run into (I search for homoglyphs and correct them sometimes on my volunteer account for "fun"). I include "к" because people use it, even though it doesn't look exactly like "k", and some, especially "Ԛ" and "Ԝ" are only convincing as normal Latin characters if you have the right fonts (which I do not on Phabricator). The set "а́е́і́о́у́" are composed (plain character "а" + combining   ́ ), but I do see them used for the precomposed Latin analogs.

аАӑӐӓӒӕӔВсСҫҪеЕѐЀёЁӗӖәӘНіІїЇјЈкКМоОӧӦрРԚѕЅТԜхХуУӯӱа́е́і́о́у́ћз

I'm also planning to work on Latin/Greek, Greek/Cyrillic, and other pairs of scripts with homoglyphs. I shudder to think whether there are any words with three or more character sets used, but at this point I wouldn't be surprised.

I shudder to think whether there are any words with three or more character sets used, but at this point I wouldn't be surprised.

And, of course there is one on English Wikipedia (until someone fixes it): Kиïвсьκa—mostly Cyrillic, but the K and ï are Latin, and the κ is Greek.

I found a few more candidates on Russian Wikipedia, too, including Валерiϊвна—mostly Cyrillic, but i is Latin and ϊ is Greek. The tail on the Greek iota can be missing, depending on the font. I see it here, but not on Russian Wikipedia.

TJones renamed this task from Russian characters not normalized to same form in search to Normalize homoglyphs in mixed-script tokens when possible.May 18 2019, 8:30 AM
TJones claimed this task.
TJones triaged this task as Medium priority.
TJones updated the task description. (Show Details)

I'll start with Latin/Cyrillic for the hackathon, and then try to add Greek (covering Latin/Greek, Greek/Cyrillic, and maybe all three at once), and then look into other potential homoglyph script pairs.

I made a little progress. I struggled with Java and while I was the underdog, I made a bit of progress. Shifting this to a 10% project now, so I'll work on it in fits and starts in the coming months.

Mstyles claimed this task.Feb 10 2020, 6:43 PM
Mstyles added a subscriber: Mstyles.

Change 571616 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[search/extra@master] Add homoglyph plugin

https://gerrit.wikimedia.org/r/571616

Change 571616 merged by Gehel:
[search/extra@master] Add homoglyph plugin

https://gerrit.wikimedia.org/r/571616

Mstyles added a comment.EditedApr 13 2020, 10:44 PM

From the analysis chain analysis comparing the chain with and without the homoglyph token filter on a sample of 10,000 random articles for each language:

Russian was the most impacted languages during testing with 1,064 new tokens added with the plugin from a sample of 2,911,553 tokens (0.037%)
Serbian had 154 new tokens generated out of a sample of 1,396,669 tokens (0.011%)
Polish had 32 new tokens generated out of a sample of 1,559,745 tokens (0.002%)
English had 30 new tokens generated out of a sample of 3,165,891 tokens (0.001%)
French had 7 new tokens generated out of a sample of 2,711,550 tokens (0.000%)

I also took a look at the comparisons that @Mstyles generated, focusing on the new tokens created, and the new collisions (i.e., words that are newly grouped with other words).

For English, all the new tokens are either all Cyrillic or all Latin, so that's good. There are only 8 new collisions in this sample, which are all like Frӧbel/Fröbel and Алeксандрович/Александрович, which is exactly what we want.

For French, we have a couple of weird mixed Cyrillic/Greek tokens being generated from a mixed Latin/Cyrillic/Greek token, which is weird but fine. There are no new collisions in this sample, so the impact on the full Wikipedia will be small, but it should be positive.

For Polish, we have a one mixed Cyrillic/number token, generated from a mixed Latin/Cyrillic/number token, which is good. There are only 12 new collisions, and like the English sample, they are all the kind we'd want: Kozerodа/Kozeroda and комiтет/Комітет.

For Russian, we have a comparatively large number of mixed Latin/Cyrillic/number tokens that generate Latin/number or Cyrillic/number tokens, but that's fine. Russian has a lot more collisions—347—but they are all of the expected type: Сhristopher/Christopher and Беларусi/Беларусі.

For Serbian, we have about 30 unexpected mixed-script tokens! Some are homoglyphs and some are not. Because Serbian has both Cyrillic and Latin alphabets, and both are used on the wiki (with automatic transliteration between them available), we convert all Cyrillic text into Latin text as part of the stemming process, because the actual stemming only works on Latin text.

Some of the source tokens for these are non-Serbian, like Belarusian "Блакiтная", which uses Cyrillic і instead of и. Serbian uses Cyrillic и and Latin i, so it's often easier for Serbian writers to type the Latin variant, and thus we get a mixed-script input like Блакiтная. However, Serbian doesn't have я, so when Блакiтная is converted to Latin, we get blakitnaя., When we convert to a Cyrillic i in Блакітная, we get the transliterated blakіtnaя, with two Cyrillic characters. This is actually okay, because now both Блакiтная and blakіtnaя will have an underlying token in common at search time and will be able to find each other, even if their internal representation is a bit weird. That was the goal all along.

The Serbian sample had 43 new collisions and, desipte the weird tokens, they are all of the desirable type: Соw/Cow and Беларусi/Беларусі.

In general, multi-script languages that use the two scripts that we are testing for homoglyphs may sometimes generate these kinds of weird tokens, but they aren't any worse than existing multi-script tokens, and they are relatively small in number, at least in the Serbian sample.

Added Maryum and my blurbs to my Notes pages for future reference.

Change 593833 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[operations/software/elasticsearch/plugins@master] increment extra plugin to 6.5.4-wmf-9

https://gerrit.wikimedia.org/r/593833

jayantanth removed a subscriber: jayantanth.

Change 593833 merged by Ryan Kemper:
[operations/software/elasticsearch/plugins@master] increment extra plugin to 6.5.4-wmf-9

https://gerrit.wikimedia.org/r/593833

Change 604221 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[mediawiki/extensions/CirrusSearch@master] Add homoglpyh plugin to French

https://gerrit.wikimedia.org/r/604221

Change 604221 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Add homoglpyh plugin to French

https://gerrit.wikimedia.org/r/604221

Change 614858 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[mediawiki/extensions/CirrusSearch@master] add homoglyph plugin for all languages

https://gerrit.wikimedia.org/r/614858

Change 614858 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] add homoglyph plugin for all languages

https://gerrit.wikimedia.org/r/614858

After the reindex started, Trey discovered during testing that the homoglypgh plugin was not working in Afrikaans. Erik noted that both tokens were being output: https://phabricator.wikimedia.org/P13023. Cyrillic and Latin are only in text and not text_search, so not getting indexed at query time. We need to ensure that text_search gets a correct copy of the config. After that, there will need to be another reindex.

Change 635057 had a related patch set uploaded (by Mstyles; owner: Mstyles):
[mediawiki/extensions/CirrusSearch@master] homoglyph plugin fix

https://gerrit.wikimedia.org/r/635057

Change 635057 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] homoglyph plugin fix

https://gerrit.wikimedia.org/r/635057

Mentioned in SAL (#wikimedia-operations) [2020-11-11T00:46:00Z] <ryankemper> T222669 [Elasticsearch reindex] Began long-running reindex of cirrus elasticsearch for codfw, eqiad, and cloudelastic. 3 tmux sessions on ryankemper@mwmaint1002: reindex_eqiad, reindex_codfw, reindex_cloudelastic

RKemper added a subscriber: RKemper.

Waiting on long-running elasticsearch reindex

It looks like it didn't take for some of the wikis. You can check the Cirrus settings dump (e.g., for enwiki) and search the page for "text" (with quotes). The first item in the "filter" array should be "homoglyph_norm", but isn't.

Most of the English wikis don't have it: enwiki, enwiktionary, enwikibooks, enwikivoyage, enwikiquote, enwikisource, and enwikiversity. However, enwikinews does. Similarly for Italian: itwiki, itwiktionary, itwikivoyage, itwikiquote, itwikisource, itwikiversity all do not have it, but itwikibooks and itwikinews do have it.

Other wikis with "monolithic" analyzers, like German don't have it because custom filters cannot be added to them. (Their "type" is "german" or something else other than "custom".) That is as expected.

... [thinking] ...

Hmmm.. I took a peek at Ryan's reindexing logs, and it looks like 108 wikis failed (across 3 data centers—though not everything failed in all three) with the text Reindex task was not successfull [sic], including the English and Italian ones above. It's complaining about token offsets overlapping, which is weird.

I pulled out a couple of page IDs from the enwiki log, and this page and this page (among others) seem to have contributed to the failure. I'll extract the text from them and see if I can see an obvious problem. They do have some mixed Latin/Cyrillic tokens, but nothing that looks too weird.

I have a hypothesis.. for English, the interaction between homoglyph_norm and aggressive_splitting is causing the problem, creating an invalid token graph.

Starting with the token Tolstoу's (with Cyrillic у), we get:

tokenizer:      { "token" : "Tolstoу's", "start_offset" : 0, "end_offset" : 9 }

homoglyph_norm: { "token" : "Tolstoу's", "start_offset" : 0, "end_offset" : 9, // Cyrillic у
                  "token" : "Tolstoy's", "start_offset" : 0, "end_offset" : 9, // Latin y
                }

aggressive_splitting: {
                  "token" : "Tolstoу", "start_offset" : 0, "end_offset" : 7, //Cyrillic у
                  "token" : "s",       "start_offset" : 8, "end_offset" : 9,
                  "token" : "Tolstoy", "start_offset" : 0, "end_offset" : 7, // Latin y
                  "token" : "s",       "start_offset" : 8, "end_offset" : 9,
                }

It seems that going from a token at 8–9 and then back to 0–7 is a problem—though it doesn't seem like this could be the first time that has ever happened.

Anyway, aggressive_splitting is of type word_delimiter, which is apparently deprecated in favor of word_delimiter_graph.

Using word_delimiter_graph solves the problem, but in a brute-force kind of way:

tokenizer:      { "token" : "Tolstoу's", "start_offset" : 0, "end_offset" : 9 }

homoglyph_norm: { "token" : "Tolstoу's", "start_offset" : 0, "end_offset" : 9, // Cyrillic у
                  "token" : "Tolstoy's", "start_offset" : 0, "end_offset" : 9, // Latin y
                }

aggressive_splitting: {
                  "token" : "Tolstoу", "start_offset" : 0, "end_offset" : 7, //Cyrillic у
                  "token" : "s",       "start_offset" : 8, "end_offset" : 9,
                  "token" : "Tolstoy", "start_offset" : 8, "end_offset" : 8, // Latin y
                       // TOKEN "LENGTH" == 0        ---^              ---^ 
                  "token" : "s",       "start_offset" : 8, "end_offset" : 9,
                }

So, with this configuration, searching for Tolstoy (Latin y) will find Tolstoу (Cyrillic y), but won't highlight it. See a screenshot below.

Italian also has aggressive_splitting enabled, as does the short_text analyzer everywhere (but short_text doesn't use homoglyph_norm).

We could switch to word_delimiter_graph for aggressive_splitting, or we could reconsider whether aggressive_splitting is a good idea in the text analyzer. @dcausse, I'd be interested in your opinion on this—though we can also talk about it Wednesday.

Anyway, the English and Italian configs explain the majority of the failures, but there are a few others that are not explained by this, including jvwiki/codfw, mrwiki/codfw, nnwiki/codfw, and possibly others—though it's odd that they only failed on codfw. I suppose that could be a particular update that made it to codfw and not eqiad or vice versa—but that's really threading the needle. I'll look at those others tomorrow.

dcausse added a comment.EditedTue, Nov 24, 9:56 AM

Hard to tell what is wrong here, I mean is there a single component that do something badly or is it just the combination of some filters that is not OK.
The problem seems to be related to a mix of a filter that can dup words (homoglyph) and a filter that can "re-tokenize" words (splitting).
It's not the first time we have a filter that duplicates words with aggressive_splitting (e.g. icu_folding) but this time it's put after the splitting filter.
I think word_delimiter_graph is just trying to avoid the problem with offsets by changing them so it's not fully resolving the problem (issues with highlighting), ideally it should output the two Tolstoу before emitting the two "s" but for this it must be looking ahead all the time...
Putting homoglyph filter after anything that can split words might fix the problem, but not sure it's the right approach either.

Differences between eqiad & codfw are concerning indeed, are the failures also related to an offset issue?

TL; DR:

  • Inconsistencies between codfw and eqiad all come down to "failed to process cluster event"
  • There are real inconsistencies between cloudelastic and codfw/eqiad
    • How out of date is cloudelastic? It seems odd that so many would get mixed homophone text all at once.
  • All the offset failures are in wikis with either en or it as their language, so we need to fix the English and Italian analysis chains and reindex all wikis with those languages.
  • We need a better way to capture failures when we reindex everything.
  • There are many failures that we didn't capture. We should re-run these (either on the specific clusters or just in general), and watch for errors, especially on codfw, and look for orphaned indexes, too. @RKemper, can you take this? Do we need a new Phab ticket?
    • itwikiquote.codfw, itwikisource.codfw, lvwikibooks.codfw, mrwikibooks.codfw, srwiki.codfw, shwiki.eqiad, jvwiki.codfw, lvwiki.codfw, mrwiki.codfw, nnwiki.codfw, ruwiktionary.eqiad, ruwikinews.eqiad

@dcausse wrote:

Differences between eqiad & codfw are concerning indeed, are the failures also related to an offset issue?

Yeah, there's more going on there.

  • Some of the special wikis didn't fail on cloudelastic because they have no data there: checkuserwiki, collabwiki, legalteamwiki, officewiki, ombudsmenwiki, stewardwiki
  • Some wikis succeeded on cloudelastic but failed elsewhere: betawikiversity, itwikinews, itwikiversity, itwikivoyage, labswiki, outreachwiki, qualitywiki, simplewikibooks, test2wiki, testwiki, testwikidatawiki, usabilitywiki, wikimania2007wiki, wikimania2014wiki
  • Some failed in all three places with offset errors: commonswiki, enwiki, enwikibooks, enwikinews, enwikiquote, enwikisource, enwikiversity, enwikivoyage, enwiktionary, incubatorwiki, itwiki, itwiktionary, mediawikiwiki, metawiki, simplewiki, sourceswiki, specieswiki, votewiki, wikidatawiki
  • itwikiquote.codfw, itwikisource.codfw, lvwikibooks.codfw, and mrwikibooks.codfw failed with this kind of error: Creating index...⧼failed to process cluster event (create-index [itwikiquote_content_1605418093], cause [api]) within 30s⧽

Other failures (not related to offsets):

  • 404 errors
    • srwiki.codfw: {"index":"srwiki_general_1605741158","type":"page","id":"3872682","cause":{"type":"index_not_found_exception","reason":"no such index and [action.auto_create_index] contains [-*] which forbids automatic creation of the index", "index_uuid": "_na_", "index": "srwiki_general_1605741158"}, "status": 404}
    • shwiki.eqiad: {"index": "shwiki_content_1605722294", "type": "page", "id": "2359467", "cause": {"type": "index_not_found_exception", "reason": "no such index", "index_uuid": "_na_", "index": "shwiki_content_1605722294"}, "status": 404}
  • 500 errors
    • jvwiki.codfw, lvwiki.codfw, mrwiki.codfw, nnwiki.codfw: {"index": "jvwiki_content_1605432348", "type": "page", "id": "102310", "cause": {"type": "node_not_connected_exception", "reason": "[elastic2059-production-search-codfw][10.192.32.5: 9300] Node not connected"}, "status": 500}
  • "Node not connected" errors
    • ruwiktionary.eqiad: {"shard": -1, "reason": {"type": "node_not_connected_exception", "reason": "[elastic1055-production-search-eqiad][10.64.16.131: 9300] Node not connected"}}
  • "All shards failed" error:
    • ruwikinews.eqiad: there's a stack trace and "Lost connection to elasticsearch cluster. The reindex task e4WY1V5PQoa4lnzwxmM1nQ:2095514169 is still running. The task should be manually canceled, and the index ruwikinews_general_1605705596 should be removed."

There are also errors that seemed to recover on these wikis:

  • eswiki.codfw, iowiktionary.codfw, mywiki.codfw, newwiki.codfw, ruwikinews.codfw, ruwikinews.codfw, ruwikinews.codfw, ruwikinews.codfw, sawikisource.eqiad, sawikisource.eqiad, shwiktionary.codfw, srwiki.cloudelastic, srwiki.cloudelastic, srwiki.codfw, srwiki.codfw, srwiki.codfw, srwiki.codfw, srwiki.eqiad, srwiki.eqiad, srwiktionary.codfw, srwiktionary.codfw, svwiki.codfw, svwiki.codfw, tawikisource.cloudelastic, tawikisource.cloudelastic, zhwiktionary.codfw
  • They have a line like "Error: " with no actual error, and normal task lines before and after.

There may be others. I grepped for fail and error in the logs. It's unfortunate that there is no consistent format, and either no useful return code or we don't check for it in the reindexing scripts.

There are real inconsistencies between cloudelastic and codfw/eqiad

Cloudelastic lost approx a month worth of updates due to the job queue growth and how long it took us to address the issue (the queue only holds jobs for 7 days, but we took most of a month to fix it). Loading cloudelastic from scratch last time didn't go so well, so decided to let the saneitizer fix it over time. Graphs show fixing rate on cloudelastic going down but it peaks in the 20-40 pages fixed/sec range, vs fix rates < 0.1/sec on eqiad/codfw.