Page MenuHomePhabricator

Investigate Unpacking Ukrainian Analyzer
Closed, ResolvedPublic8 Estimated Story Points

Description

See parent task for details.

Note that the Ukrainian analyzer is third-party, and so may have limitations on unpacking, depending on what's available in the plugin.

(Ukrainian was prioritized because it aligns with OKRs.)

Event Timeline

TJones set the point value for this task to 5.Sep 26 2022, 4:00 PM
TJones changed the point value for this task from 5 to 8.Oct 25 2022, 7:18 PM

So... the components of the analyzer are all defined together in one object, and the elements are all clear in the code: standard tokenizer, lowercase, stopwords, and stemmer, along with a pre-tokenization char filter on line 50. The stopwords are available as a plaintext file, and the dictionary used for the stemmer has been extracted out into it's own separate artifact.

I did a quick analysis of a small set of Ukrainian Wikipedia articles (just 500), and there are definitely some small fixes to be made—Cyrillic-Latin homoglyphs (esp Ukrainian і vs Latin i) and bidi markers on Arabic and Latin tokens, plus the expected usual suspects for ICU folding.

David and I talked about it and reviewed the code and other resources earlier today, and it looks like the best thing to do in the short term is fork the project for the analyzer and strip it down to just the stemmer. The other non-standard components—stopwords and char filter—can be recreated in Cirrus, like we have for other analyzers. And of course we already have the infrastructure for supporting our own plugins in general.

Since the stemming dictionary is a separate component, our new stemmer plugin would get most of the benefit of any likely future updates from there. I'm not too concerned about the stopword list—while it could be refined a bit in the future, the main list of common stopwords isn't going to change.

Longer term, we could make some upstream changes—the most obvious of which would be exposing all the components so anyone could customize, like we want to be able to do—but that's not in scope for this task and such changes wouldn't solve our immediate problem, since we are on ES 7.10 and not continuing with later versions of Elastic.

The other option is to skip Ukrainian, but David and I are both averse to that, since it would leave it out of any future generic analysis improvements.

I'm upping the estimate from 5 to 8 to reflect the broader scope of the task now.

Change 851086 had a related patch set uploaded (by Tjones; author: Tjones):

[search/extra@master] Build Ukrainian Stemmer Plugin

https://gerrit.wikimedia.org/r/851086

Change 851086 merged by jenkins-bot:

[search/extra@master] Build Ukrainian Analysis Plugin

https://gerrit.wikimedia.org/r/851086

Full write up on Mediawiki:

Lowercasing and Multiple tokens

  • A small number of input tokens generate multiple stemmer output tokens
  • Stemmer output is sometimes capitalized
  • Sometimes, the multiple output tokens differ only by capitalization
  • Re-lowercasing and deduplicating removes about 5% of tokens!

Homoglyphs

  • There are lots of mixed-script tokens with homoglyphs in them (Ukrainian і and ї are hard to type), and the homoglyph filter groups them with their fully Cyrillic counterparts!

ICU Folding

  • No exceptions enabled
    • й/и and ї/і are all in the Ukrainian alphabet, but folding them causes very few mergers, and most are obviously good (typos, or inconsistently transliterated names)
    • ґ/г is already folded by the Ukrainian analyzer, so I didn't mess with it. (ґ was only added back to the alphabet in 1990!)

Patch with config updates and tests coming soon.

Change 858373 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Unpack and Upgrade Ukrainian Analysis Chain

https://gerrit.wikimedia.org/r/858373

Change 858373 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Unpack and Upgrade Ukrainian Analysis Chain

https://gerrit.wikimedia.org/r/858373