Page MenuHomePhabricator

CRH Transliteration pattern matching fixes
Closed, ResolvedPublic

Description

There are two known problems with the current pattern matching:

  1. I implemented part of the Crimean Tatar (crh) Language Converter transliteration incorrectly, with the large list of exceptions treated as whole words to be matched, not patterns to match against partial words. The partial matching is necessary to deal with large number of inflections in the language. (See T186727#3998090.)
  1. There are some very complex regexes that I converted from assuming they were running on the full text to running on individual tokens/words. They don't work quite right. (See T186727#3998090 again.)

One option for (1) is to change the implementation to be more like the Javascript implementation, which runs all the regexes over the full text to be transliterated. However, there are a lot of regexes (3500+) and they run on multiple strings to render a page.

Another option is to find a more complex but more efficient implementation that still operates on tokens/words.

Choosing between those options for (1) will determine how best to proceed for (2), and there may need to be a significant re-architecture of the transliteration as a result.

Event Timeline

Notes on options for issue (1), matching exceptions:

There are 1174 main exception mappings, which get expanded by a factor of three to account for case variation (lowercase, Capitalized, and UPPERCASE), plus a handful of other mappings. In the PHP/language converter implementation, text is currently tokenized into words for easier lookup and matching against this list of 3522 exceptions. The analogous Javascript implementation uses regexes for all of these exceptions, but runs against the entire text at once.

The PHP implementation needs to be re-architected to treat the exceptions as patterns. It would be extremely inefficient to run 3500 regexes against every word.

One obvious option is to run all 3500 regexes against the full text being transliterated. However, this may still be very inefficient as several bits of text are transliterated separately on the page (for example the title seems to be run more than once, and the article text gets run once), so we're looking at running 10,000 or more regexes to render a transliterated page.

Another option is to take an idea from the insource: regex search in CirrusSearch, which indexes pages by trigrams, and looks for trigrams in the regex to limit the number of pages to be scanned with the regex. All but a handful of patterns are at least 3 letters long, so we could group them by the first three letters of the pattern and only run against words that contain those initial trigrams; if a word doesn't contain the first three letters of a pattern, it won't contain the whole pattern.

Quick analysis: The 1174 lowercase patterns have 429 unique initial Cyrillic trigrams, and 452 unique initial Latin trigrams. On average, that's two or three patterns to match per trigram. (For the Cyrillic patterns, the median is actually 1 pattern, with 95% having 10 or fewer patterns, and the max of 48 patterns for one trigram—тюш.)

A sample of 1070 words (8827 characters) from a Crimean Tatar Wikipedia article has 5232 trigrams to look up (the median word length is 6, which gives 4 trigrams to look up; 5232/1070 gives a mean of 4.8, so that's 4 or 5 on average). Of the 1295 unique trigrams (out of the 5232 total), only 156 match. So, on average, each trigram occurs about 4 or 5 times, 156 will match, and each will have 1 or 2 patterns to run—giving a max of 1560 patterns to try for the 1070-word document. I think it's okay to extrapolate from all lowercase numbers because the lowercase, Capitalized, and UPPERCASE patterns don't overlap, since they all differ in their first three letters.

These estimates will not survive encountering Zipf's law but they probably give us the right order of magnitude. If I pursue this approach, I'll actually run some example texts, including this specific text, and see how good these BOTE estimates really are.

In this case, for the main article text, we can match about 3500 regexes each against an 8800 character string (and hope the regex engine is smart enough to do it quickly), or we can match fewer than about 1500 regexes each against an average token of fewer than 10 characters—plus all the overhead of indexing the patterns, pulling out the trigrams, deciding what order to run the regexes (we should run longer patterns first, regardless of what trigram they come from, for example), etc.

It may also be possible to do a separate analysis of the patterns and remove any redundancies—there may be shorter patterns that do the job of longer patterns just as well. This could remove only a handful of patterns (we know there are some), or a significant number of patterns. I'm not sure.

I spoke to some other developers on the search team, including folks who are more knowledgeable about PHP and how our servers are configured. It may be possible to use a giant regex (which would be parsed for maximum efficiency and cached by PHP) and preg_replace_callback() to implement the equivalent of preg_replace_callback_array() (which is available in PHP 7, so not yet available for MediaWiki).

When I get back to this, I'll run some tests and on the various options and see how quickly I can find a happy medium between reasonable performance and difficulty of implementation. (It's also possible I'm worrying a bit too much about using so many regexes—but I don't want to build something super inefficient.)

Ugh. I got a decent version of #1 here and T189512 working yesterday. I was still struggling with Roman numerals—some anchoring of some particular regular expressions was misbehaving—and some of the other regexes still needed to be re-integrated, but it was working. I ran the unit tests and got an unexpected failure!: Compilation failed: regular expression is too large at offset 55600.

Apparently, the giant regex I used with preg_replace_callback() is so big that it can cause failures. This is particularly concerning because it seemed to work fine on my laptop running a small local wiki, but failed when I ran the unit tests. That makes it seem unreliable and prone to unexpected failure in different environments.

So I guess I'm back to either running all the regexes (which has the advantages of being easy to implement, easy to order, and can work on the whole text at once, and the disadvantage of potentially being too slow) or adding the complexity of indexing the patterns (which has the advantage of requiring an order of magnitude fewer regexes to run, but the disadvantages of being generally complex, requiring tokenization, and making pattern-matching ordering difficult). Ugh, indeed.

Change 424738 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/core@master] CRH Transliteration Pattern Matching Fixes

https://gerrit.wikimedia.org/r/424738

@DonAlessandro, I'd appreciate it if you could review the patch above for linguistic improvement. If you have any questions or suggestions, let me know!

I will review it by tomorrow afternoon (UTC +3).
Is it OK, that all the exeptions are listed as 'кириллица' => 'latin' even those, that are actually used in Latin => Cyrillic transliteration?

I will review it by tomorrow afternoon (UTC +3).

Thanks!

Is it OK, that all the exeptions are listed as 'кириллица' => 'latin' even those, that are actually used in Latin => Cyrillic transliteration?

It should be. Because of the misunderstanding I originally had about the substitutions being for whole words, it made sense to map them in both directions since there was a lot of overlap. I refactored and de-duplicated everything and merged both directions into one list. If it doesn't make sense in both directions as patterns, then I can refactor it again.

I'm hoping that won't be necessary, but if it is, I think we should open another ticket and try to get this out as an improvement. Almost everything on your sample page now transliterates correctly. The last few remaining differences seem to be individual words or details of patterns, not whether the patterns in general are being applied correctly.

(Again, I'm sorry this has been such a complicated process and that it has taken so long. We are getting there!)

Sorry, it will be Wednesday afternoon. I haven't got enough time today.

Okay... a plausibly final version of the patch is up. Just waiting for code review. Thanks to @DonAlessandro for all the help getting everything just right.

Change 424738 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/core@master] CRH Transliteration Pattern Matching Fixes

https://gerrit.wikimedia.org/r/424738

Change 424738 merged by jenkins-bot:
[mediawiki/core@master] CRH Transliteration Pattern Matching Fixes

https://gerrit.wikimedia.org/r/424738

Working on the live CRH Wikipedia.