Page MenuHomePhabricator

Unpack Hindi, Irish, Norwegian Elasticsearch Analyzers
Closed, ResolvedPublic5 Estimated Story Points

Description

See parent task for details.

(These were chosen next more or less at random.)

Event Timeline

TJones triaged this task as High priority.Aug 24 2021, 6:58 PM
TJones created this task.
TJones set the point value for this task to 5.

Change 732108 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Unpack Hindi, Irish, and Norwegian Analyzers

https://gerrit.wikimedia.org/r/732108

Irish was a bit eventful—

  • It didn't have the usual dotted İ regression because we keep the Irish-specific lowercasing and add ICU normalization (rather than having it replace lowercasing).
  • I ran into a few instances of older orthography, using dotted characters like ḃ, which are written as bh in modern orthography. I added a character filter to do the mapping (ḃ => bh, etc.), and it has a small but positive impact.
  • Because of the way my script counts things, when small groups and large groups merged in Irish (which happened a lot), the large group was often counted as merging into the small group, inflating the merger impact numbers.

Hindi had a ton of bidirectional invisibles, and very little impact from ICU Normalization or folding, because they don't do much to Devenagari script characters.

Norwegian was much more typical all around.

Full write up on Mediawiki.

I also did some refactoring of the AnalysisConfigBuilder code to normalize the handling of additional somewhat atypical filters in Hindi and (retroactively) German.

Given that ICU normalization and folding didn't have much effect on Devenagari script characters, does it make sense to lower the priority of unfolding other languages that use the same script? Is this true for other non-Latin scripts as well?

Given that ICU normalization and folding didn't have much effect on Devenagari script characters, does it make sense to lower the priority of unfolding other languages that use the same script? Is this true for other non-Latin scripts as well?

Good question! There aren't any other languages on the unpacking list that use Devanagari, and the situation is definitely different for other non-Latin scripts.

In general, the purpose of unpacking is to pay down some technical debt and enable generic "upgrades" that we've implemented (like homoglyphs—implemented globally—and word_break_helper, which breaks tokens on underscores, periods, and parens, and is currently implemented piecemeal but should be helpful everywhere).

ICU normalization (less aggressive) is generally useful, and ICU folding (more aggressive) is useful if customized when necessary. ICU folding has the biggest impact when unpacking, but it's a bonus improvement over the main goal of unpacking.

ICU folding is also most helpful on a given wiki for foreign words, because it reduces "foreign" characters to more basic forms, which you are more likely to be able to type—and because the characters are foreign and thus uncommon, the conflation is unlikely to lead to too many false positives.

So, it's still useful on Hindi Wikipedia. It seems that names are generally given in transliteration, so the page on François Englert (the first François I found) is titled फ्रांसोवा आंगलेया (a rough phonetic transliteration). His name is given in Latin script in the article as François Englert, so searching for that will find him. However, if you can't type that ç—and search for Francois Englert—you won't find him. ICU folding will fix that!

Finally, the analysis so far is based on the content of Hindi Wikipedia. I would not be surprised if Latin script queries are more common (and I also wouldn't be surprised if they are not). So there could still be a useful improvement in zero-results rate after this is deployed if people search for things like Francois Englert, Gerhard Schroder, and Nguyen Xuan Phuc (all zero-results) instead of François Englert, Gerhard Schröder, and Nguyễn Xuân Phúc (all get hits).

Change 732108 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Unpack Hindi, Irish, and Norwegian Analyzers

https://gerrit.wikimedia.org/r/732108