⚓ T272606 [EPIC] Unpack all Elasticsearch analyzers

	Subject	Repo	Branch	Lines +/-
	Refactor AnalysisConfigBuilder Standard Analyzers	mediawiki/extensions/CirrusSearch	master	+638 -305
	Refactor Analysis Config Builder	mediawiki/extensions/CirrusSearch	master	+2 K -2 K

Status	Assigned	Task
Open	None	T219550 [EPIC] Harmonize language analysis across languages
Resolved	Gehel	T272606 [EPIC] Unpack all Elasticsearch analyzers
Resolved	TJones	T277699 Unpack Spanish Elasticsearch Analyzer
Resolved	TJones	T282808 Reindex Spanish-language wikis to enable unpacked version of Spanish analysis
Resolved	TJones	T281379 Unpack German, Portuguese, and Dutch Elasticsearch Analyzers
Resolved	TJones	T284185 Reindex German, Dutch, and Portugese Wikis to Enabled Unpacked Versions
Resolved	TJones	T226812 de.wikipedia: search for "Bedusz" does not find "Będusz"
Resolved	TJones	T104814 Appropriately ignore diacritics for German-language wikis
Resolved	TJones	T283366 Unpack Basque, Catalan, Danish Elasticsearch Analyzers
Resolved	TJones	T284691 Reindex Basque, Catalan, Danish wikis to enable unpacked versions
Resolved	TJones	T284578 Unpack Czech, Finnish, Galician Elasticsearch Analyzers
Resolved	TJones	T290079 Reindex Czech, Finnish, Galician wikis to enable unpacked versions
Resolved	TJones	T289612 Unpack Hindi, Irish, Norwegian Elasticsearch Analyzers
Resolved	TJones	T294257 Reindex Hindi, Irish, Norwegian wikis to enable unpacked versions
Resolved	TJones	T294067 Install and unpack Bengali analyzer
Resolved	TJones	T315265 Reindex Bengali wikis to enable new analyzer
Resolved	TJones	T294147 Unpack Arabic & Thai Elasticsearch Analyzers
Resolved	TJones	T319420 Reindex Arabic & Thai wikis to enable unpacked versions
Resolved	TJones	T316817 Explore Using Arabic Analysis Chain for Egyptian Arabic and Moroccan Arabic
Resolved	TJones	T322044 Reindex Egyptian Arabic and Moroccan Arabic wikis to enable Arabic language analysis
Resolved	TJones	T318264 Investigate Unpacking Ukrainian Analyzer
Resolved	RKemper	T322776 Deploy Ukrainian Analyzer Plugin
Resolved	TJones	T323927 Reindex Ukrainian-language wikis to enable unpacked analysis
Resolved	TJones	T325089 Unpack Armenian, Latvian, Hungarian Elasticsearch Analyzers
Resolved	TJones	T327801 Reindex Armenian, Latvian, Hungarian wikis to enable unpacked analyzers
Resolved	TJones	T325090 Unpack Bulgarian, Lithuanian, Persian Elasticsearch Analyzers
Resolved	TJones	T328315 Reindex Bulgarian, Lithuanian, Persian wikis to enable unpacked analyzers
Resolved	TJones	T325091 Unpack Romanian, Sorani Elasticsearch Analyzers
Resolved	TJones	T330893 Map Romanian s&t with comma to cedilla internally
Resolved	TJones	T330783 Reindex Romanian, Sorani wikis to enable unpacked analyzers
Resolved	TJones	T325092 Unpack Brazilian (Portuguese) Elasticsearch Analyzer
Resolved	TJones	T333398 Reindex brwikimedia to use new unpacked Brazlian Portuguese analysis chain
Resolved	TJones	T329762 Unpack Turkish Analyzer and improve apostrophe handling
Resolved	TJones	T337064 Reindex Turkish wikis to enable improved apostrophe handling
Resolved	TJones	T332322 Install and unpack Estonian analyzer
Resolved	TJones	T335704 Reindex Estonian wikis to enable new unpacked analyzer

Restricted Application added subscribers: Huji, Strainu, jhsoby, Base. · View Herald TranscriptMar 8 2021, 9:54 PM

Change 672567 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/extensions/CirrusSearch@master] Refactor Analysis Config Builder

https://gerrit.wikimedia.org/r/672567

gerritbot added a project: Patch-For-Review.Mar 15 2021, 11:58 PM

@TJones, would it be possible to put in a little more detail here about how we're measuring "search can incrementally improve for all users" here: i.e. what test results we're hoping to see? It'd be great for each unpacked analyzer we can also list the test results/improvement for that language

TJones renamed this task from Unpack all Elasticsearch analyzers to [EPIC] Unpack all Elasticsearch analyzers.Mar 17 2021, 7:37 PM

TJones updated the task description. (Show Details)

TJones updated the task description. (Show Details)Mar 17 2021, 8:03 PM

In T272606#6921860, @MPhamWMF wrote:

@TJones, would it be possible to put in a little more detail here about how we're measuring "search can incrementally improve for all users" here: i.e. what test results we're hoping to see? It'd be great for each unpacked analyzer we can also list the test results/improvement for that language

First, note that you can't customize monolithic analyzers; additional configuration is just ignored.

So, there are two aspects to improving search. The first is that we have some analysis chain "upgrades" that we apply automatically to unpacked analysis chains, and some that are available, but added manually. These include

Switching from simple lowercasing to ICU normalization (automatic). This enabled a lot of conversions of rare characters, atypical forms of characters, and invisible characters, all of which are often hard to detect (not just the invisible ones) depending on fonts. Full details here.
Normalizing homoglyphs (automatic). This detects many otherwise invisible-to-the-eye homoglyphs (current Latin/Cyrillic) and indexes the original form and the converted form. More info and examples here. Simple example: mixed aрaсe (Cyrillic consonants and Latin vowels) would be indexed along with all-Latin apace and all-Cyrillic арасе—which are subtly different here in italics.
Enabling word_break_helper (manual). This breaks up words on underscores (word_break), periods (www.wikipedia.org), and paren(thetical)s.
Enabling the Hiragana-to-Katakana map (manual, currently English only). This merges Japanese Hiragana and Katakana so they can find each other; it was a feature request for English, and comments made it seem desirable for other wikis, too (but not Japanese). See more here.
Enable ICU Folding (manual). This a much more aggressive version of ICU normalization; it folds a lot of characters with diacritics, though we generally exclude characters that are in the alphabet of the language (e.g., in English we fold ö to o, but in Swedish we don't).

Unpacking also allows us to further customize the analysis chains. In the past, we unpacked analyzers specifically so we could make language-specific fixes or improvements (and the other improvements were incidental). Some tickets this would enable fixing include:

That last one might actually be fixed with ICU normalization enabled, so it could be a freebie!

In terms of measurable improvements, a lot of these are individually difficult to detect, since they are subtle improvements in recall. Enwiki users wouldn't necessarily notice that searching for resume failed to find articles that only had the word résumé in them. (German users did notice in the case of Bedusz/Będusz—T226812 above—because there are none of the former to hide the lack of the latter.)

It would be possible to list some examples for each unpacked analyzer—like saying, Bedusz and Będusz now index as the same. The lists for each would be similar, though the specific examples would vary depending on what was in the test sample. Would that be helpful?

In T272606#6921860, @MPhamWMF wrote:

@TJones, would it be possible to put in a little more detail here

Clearly, that's a "no" on the "little" part. ;)

I would have written a shorter note, but I did not have the time.

Change 672567 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Refactor Analysis Config Builder

https://gerrit.wikimedia.org/r/672567

ReleaseTaggerBot added a project: MW-1.36-notes (1.36.0-wmf.36; 2021-03-23).Mar 18 2021, 9:00 AM

Maintenance_bot removed a project: Patch-For-Review.Mar 18 2021, 9:10 AM

Is it possible to know based on a sampling what percentage of queries in language L are affected by an improvement to L's analyzer? i.e. after unpacking the English analyzer, 10% of queries were affected, including ones that included resume/résumé

Ideally we would know both the scale of the impact in addition to the degree and direction of that impact. I recognize measuring the latter two are not well defined, but I think it would be helpful to know how many queries we're potentially affecting at all. Not that I think we should try to optimize that number specifically, but it's useful to know, just as it's useful to know how many searchers use title matching vs full text search (where we are also not currently actively trying to increase that number in one direction or the other)

[ I have tried to be brief... but it's a struggle because it goes against my nature! ]

The problem is that changes in either the query or the index could affect results. So, resume in the query could stay the same, but still get more results because résumé in the index now matches. Or vice versa.

If we want to use this for assessing impact, rather than making a decision before implementation, a reasonable compromise might be to do a somewhat time-delayed before-and-after comparison by running half the comparison on the live site before the change and the other half after the change is pushed out to production, noting that the natural variation from edits will be more pronounced the longer the time difference between the two halves.

This approach would be less work than building a full modified index in RelForge, and much more accurate than looking at just changes in either a sample of the index or a sample of queries.

The other wrinkle is the query sample. We have some heuristics we typically use to filter bots, power users, and other queries that don't come from "normal humans", so the query sample may be somewhat less representative of all users, but also more representative of typical users (in theory).

If this sounds reasonable, I can try it for Spanish and we can see what happens.

TJones moved this task from Incoming to Epics on the Discovery-Search (Current work) board.Mar 22 2021, 3:28 PM

If we want to use this for assessing impact, rather than making a decision before implementation...

Yeah, I was thinking about using this to start tracking our work right now, rather than to make decisions.

We have some heuristics we typically use to filter bots...

I don't mind refining what we're looking at further, but right now I don't think it's even clear on any level, so I was just trying to start broad to get some baselines.

running half the comparison on the live site before the change and the other half after the change is pushed out to production

If I understand this correctly, this seems like a reasonable compromise to try, starting with Spanish. What does "half the comparison" refer to exactly though?

running half the comparison on the live site before the change and the other half after the change is pushed out to production

If I understand this correctly, this seems like a reasonable compromise to try, starting with Spanish. What does "half the comparison" refer to exactly though?

I meant gathering half the comparison data. Ideally, we'd run the comparison on two static snapshots of the wiki, with one configured as "before" and one configured as "after", but that's a lot of work—especially to do it a couple dozen times. Instead, we can run a sample on the live wiki before the change (one half), and then re-run the sample after the change (the other half) and compare. There will be random changes because the wiki is always being updated, but hopefully we can see the improvements.

Strainu unsubscribed.Mar 25 2021, 12:28 PM

TJones updated the task description. (Show Details)May 13 2021, 6:15 PM

TJones updated the task description. (Show Details)

Gehel closed subtask T277699: Unpack Spanish Elasticsearch Analyzer as Resolved.May 19 2021, 2:53 PM

TJones updated the task description. (Show Details)May 21 2021, 3:29 PM

TJones removed a project: MW-1.36-notes (1.36.0-wmf.36; 2021-03-23).

TJones updated the task description. (Show Details)Jun 8 2021, 3:43 PM

TJones updated the task description. (Show Details)Jun 9 2021, 7:19 PM

Gehel closed subtask T281379: Unpack German, Portuguese, and Dutch Elasticsearch Analyzers as Resolved.Jun 21 2021, 11:38 AM

Gehel closed subtask T283366: Unpack Basque, Catalan, Danish Elasticsearch Analyzers as Resolved.Aug 16 2021, 2:44 PM

TJones updated the task description. (Show Details)Aug 24 2021, 6:58 PM

TJones updated the task description. (Show Details)