See parent task for details.
(These were chosen next more or less at random.)
See parent task for details.
(These were chosen next more or less at random.)
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
mediawiki/extensions/CirrusSearch | master | +809 -45 | Unpack Hindi, Irish, and Norwegian Analyzers |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T219550 [EPIC] Harmonize language analysis across languages | |||
Open | None | T272606 [EPIC] Unpack all Elasticsearch analyzers | |||
Resolved | TJones | T289612 Unpack Hindi, Irish, Norwegian Elasticsearch Analyzers | |||
Resolved | TJones | T294257 Reindex Hindi, Irish, Norwegian wikis to enable unpacked versions |
Change 732108 had a related patch set uploaded (by Tjones; author: Tjones):
[mediawiki/extensions/CirrusSearch@master] Unpack Hindi, Irish, and Norwegian Analyzers
Irish was a bit eventful—
Hindi had a ton of bidirectional invisibles, and very little impact from ICU Normalization or folding, because they don't do much to Devenagari script characters.
Norwegian was much more typical all around.
Full write up on Mediawiki.
I also did some refactoring of the AnalysisConfigBuilder code to normalize the handling of additional somewhat atypical filters in Hindi and (retroactively) German.
Given that ICU normalization and folding didn't have much effect on Devenagari script characters, does it make sense to lower the priority of unfolding other languages that use the same script? Is this true for other non-Latin scripts as well?
Good question! There aren't any other languages on the unpacking list that use Devanagari, and the situation is definitely different for other non-Latin scripts.
In general, the purpose of unpacking is to pay down some technical debt and enable generic "upgrades" that we've implemented (like homoglyphs—implemented globally—and word_break_helper, which breaks tokens on underscores, periods, and parens, and is currently implemented piecemeal but should be helpful everywhere).
ICU normalization (less aggressive) is generally useful, and ICU folding (more aggressive) is useful if customized when necessary. ICU folding has the biggest impact when unpacking, but it's a bonus improvement over the main goal of unpacking.
ICU folding is also most helpful on a given wiki for foreign words, because it reduces "foreign" characters to more basic forms, which you are more likely to be able to type—and because the characters are foreign and thus uncommon, the conflation is unlikely to lead to too many false positives.
So, it's still useful on Hindi Wikipedia. It seems that names are generally given in transliteration, so the page on François Englert (the first François I found) is titled फ्रांसोवा आंगलेया (a rough phonetic transliteration). His name is given in Latin script in the article as François Englert, so searching for that will find him. However, if you can't type that ç—and search for Francois Englert—you won't find him. ICU folding will fix that!
Finally, the analysis so far is based on the content of Hindi Wikipedia. I would not be surprised if Latin script queries are more common (and I also wouldn't be surprised if they are not). So there could still be a useful improvement in zero-results rate after this is deployed if people search for things like Francois Englert, Gerhard Schroder, and Nguyen Xuan Phuc (all zero-results) instead of François Englert, Gerhard Schröder, and Nguyễn Xuân Phúc (all get hits).
Change 732108 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Unpack Hindi, Irish, and Norwegian Analyzers