Unpack Hindi, Irish, Norwegian Elasticsearch Analyzers
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	TJones
	Aug 24 2021, 6:58 PM

Description

See parent task for details.

(These were chosen next more or less at random.)

Details

	Subject	Repo	Branch	Lines +/-
	Unpack Hindi, Irish, and Norwegian Analyzers	mediawiki/extensions/CirrusSearch	master	+809 -45

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T219550 [EPIC] Harmonize language analysis across languages
Resolved	Gehel	T272606 [EPIC] Unpack all Elasticsearch analyzers
Resolved	TJones	T289612 Unpack Hindi, Irish, Norwegian Elasticsearch Analyzers
Resolved	TJones	T294257 Reindex Hindi, Irish, Norwegian wikis to enable unpacked versions

Event Timeline

TJones triaged this task as High priority.Aug 24 2021, 6:58 PM

TJones created this task.

TJones mentioned this in T272606: [EPIC] Unpack all Elasticsearch analyzers.

TJones set the point value for this task to 5.

TJones moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.Aug 24 2021, 7:00 PM

TJones moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.Aug 24 2021, 8:50 PM

Change 732108 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Unpack Hindi, Irish, and Norwegian Analyzers

https://gerrit.wikimedia.org/r/732108

gerritbot added a project: Patch-For-Review.Oct 20 2021, 2:08 AM

Irish was a bit eventful—

It didn't have the usual dotted İ regression because we keep the Irish-specific lowercasing and add ICU normalization (rather than having it replace lowercasing).
I ran into a few instances of older orthography, using dotted characters like ḃ, which are written as bh in modern orthography. I added a character filter to do the mapping (ḃ => bh, etc.), and it has a small but positive impact.
Because of the way my script counts things, when small groups and large groups merged in Irish (which happened a lot), the large group was often counted as merging into the small group, inflating the merger impact numbers.

Hindi had a ton of bidirectional invisibles, and very little impact from ICU Normalization or folding, because they don't do much to Devenagari script characters.

Norwegian was much more typical all around.

Full write up on Mediawiki.

I also did some refactoring of the AnalysisConfigBuilder code to normalize the handling of additional somewhat atypical filters in Hindi and (retroactively) German.

TJones moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Oct 20 2021, 9:43 PM

Given that ICU normalization and folding didn't have much effect on Devenagari script characters, does it make sense to lower the priority of unfolding other languages that use the same script? Is this true for other non-Latin scripts as well?

In T289612#7448960, @MPhamWMF wrote:

Given that ICU normalization and folding didn't have much effect on Devenagari script characters, does it make sense to lower the priority of unfolding other languages that use the same script? Is this true for other non-Latin scripts as well?

Good question! There aren't any other languages on the unpacking list that use Devanagari, and the situation is definitely different for other non-Latin scripts.

In general, the purpose of unpacking is to pay down some technical debt and enable generic "upgrades" that we've implemented (like homoglyphs—implemented globally—and word_break_helper, which breaks tokens on underscores, periods, and parens, and is currently implemented piecemeal but should be helpful everywhere).

ICU normalization (less aggressive) is generally useful, and ICU folding (more aggressive) is useful if customized when necessary. ICU folding has the biggest impact when unpacking, but it's a bonus improvement over the main goal of unpacking.

ICU folding is also most helpful on a given wiki for foreign words, because it reduces "foreign" characters to more basic forms, which you are more likely to be able to type—and because the characters are foreign and thus uncommon, the conflation is unlikely to lead to too many false positives.

So, it's still useful on Hindi Wikipedia. It seems that names are generally given in transliteration, so the page on François Englert (the first François I found) is titled फ्रांसोवा आंगलेया (a rough phonetic transliteration). His name is given in Latin script in the article as François Englert, so searching for that will find him. However, if you can't type that ç—and search for Francois Englert—you won't find him. ICU folding will fix that!

Finally, the analysis so far is based on the content of Hindi Wikipedia. I would not be surprised if Latin script queries are more common (and I also wouldn't be surprised if they are not). So there could still be a useful improvement in zero-results rate after this is deployed if people search for things like Francois Englert, Gerhard Schroder, and Nguyen Xuan Phuc (all zero-results) instead of François Englert, Gerhard Schröder, and Nguyễn Xuân Phúc (all get hits).