Page MenuHomePhabricator

[EPIC] Unpack all Elasticsearch analyzers
Open, HighPublic

Description

User Story: As a search engineer, I want to have general improvements apply to analysis chains for all languages, and to be able to customize and improve individual analysis chains, so that search can incrementally improve for all users.

The intent is to unpack the analyzers (i.e., converting language-specific monolithic analyzers to their constituent parts, based on their breakdown from Elastic) and then apply general improvements (upgrading lowercasing to ICU norm and adding homoglyph norm, for example), test the results, deploy the changes, and re-index.

There are 27 analyzers to unpack, and there seem to be two more analyzers to enable (which is more to test, since we’d be enabling new stemmers). Re-indexing takes time, too. Because of the number of analyzers to work on, this task is probably "Epic-ish".

Details: There are 25 Wikipedias using the default Elasticsearch analyzers. All of these need to be unpacked: Arabic, Armenian, Basque, Bulgarian, Catalan, Czech, CJK for Japanese, Danish, Dutch, Finnish, Galician, German, Hindi, Hungarian, Irish, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Sorani, Spanish, Turkish, Thai. There is also a "Brazillian" analyzer from Elastic, for Brazillian Portuguese, which is used by br.wikimedia.org.

There is one non-Elastic monolithic analyzer plugin that we have deployed that should be unpackable: Ukrainian.

Two others analyzers are available, but aren't in use on their respective Wikipedias: Bengali, Estonian. Those should be enabled (as unpacked analyzers) and tested, though that will be more involved since they both include stemmers.

Re-indexing after unpacking is not too much actual work if it goes smoothly, but does take time for larger wikis (Spanish, German, Portuguese, Dutch, Japanese, and Ukrainian all have >1M articles) and may be delayed to sync with other SRE activities.

As we unpack more analyzers, we may also need to do some refactoring of AnalysisConfigBuilder. It may also be worthwhile to do some simple bug fixes after unpacking, to minimize re-indexing, but we'll have to see how quickly we proceed and how "simple" the bugs turn out to be. ;)

Done / In Progress—See tickets for deployment status and T147505 for reindexing status

Lots more languages—in alphabetical order

  • Arabic
  • Armenian
  • Bulgarian
  • Hindi
  • Hungarian
  • Irish
  • Latvian
  • Lithuanian
  • Norwegian
  • Persian
  • Romanian
  • Sorani
  • Turkish
  • Thai

Unusual cases

  • Brazillian (only used on br.wikimedia.org)
  • CJK for Japanese (may be worth looking into Kuromoji again)
  • Ukrainian (not from Elasticsearch, may be more complicated)

Not yet in use for some reason—so install and unpack

  • Bengali
  • Estonian

Acceptance Criteria (per analyzer):

  • Unpacked analyzers perform the same as their monolithic counterparts (without general upgrades).
  • Upgraded analyzers either have no unexpected impact (we know what to expect from ICU norm and homoglyph norm, for example), or the impact is reviewed by a speaker of the language.
  • Analysis changes are deployed, re-indexing sub-tasks are created here, and linked to in T147505.

Related Objects

Event Timeline

CBogen triaged this task as High priority.Jan 25 2021, 4:25 PM
CBogen moved this task from needs triage to Language Stuff on the Discovery-Search board.

Change 672567 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/extensions/CirrusSearch@master] Refactor Analysis Config Builder

https://gerrit.wikimedia.org/r/672567

@TJones, would it be possible to put in a little more detail here about how we're measuring "search can incrementally improve for all users" here: i.e. what test results we're hoping to see? It'd be great for each unpacked analyzer we can also list the test results/improvement for that language

TJones renamed this task from Unpack all Elasticsearch analyzers to [EPIC] Unpack all Elasticsearch analyzers.Mar 17 2021, 7:37 PM
TJones updated the task description. (Show Details)

@TJones, would it be possible to put in a little more detail here about how we're measuring "search can incrementally improve for all users" here: i.e. what test results we're hoping to see? It'd be great for each unpacked analyzer we can also list the test results/improvement for that language

First, note that you can't customize monolithic analyzers; additional configuration is just ignored.

So, there are two aspects to improving search. The first is that we have some analysis chain "upgrades" that we apply automatically to unpacked analysis chains, and some that are available, but added manually. These include

  • Switching from simple lowercasing to ICU normalization (automatic). This enabled a lot of conversions of rare characters, atypical forms of characters, and invisible characters, all of which are often hard to detect (not just the invisible ones) depending on fonts. Full details here.
  • Normalizing homoglyphs (automatic). This detects many otherwise invisible-to-the-eye homoglyphs (current Latin/Cyrillic) and indexes the original form and the converted form. More info and examples here. Simple example: mixed aрaсe (Cyrillic consonants and Latin vowels) would be indexed along with all-Latin apace and all-Cyrillic арасе—which are subtly different here in italics.
  • Enabling word_break_helper (manual). This breaks up words on underscores (word_break), periods (www.wikipedia.org), and paren(thetical)s.
  • Enabling the Hiragana-to-Katakana map (manual, currently English only). This merges Japanese Hiragana and Katakana so they can find each other; it was a feature request for English, and comments made it seem desirable for other wikis, too (but not Japanese). See more here.
  • Enable ICU Folding (manual). This a much more aggressive version of ICU normalization; it folds a lot of characters with diacritics, though we generally exclude characters that are in the alphabet of the language (e.g., in English we fold ö to o, but in Swedish we don't).

Unpacking also allows us to further customize the analysis chains. In the past, we unpacked analyzers specifically so we could make language-specific fixes or improvements (and the other improvements were incidental). Some tickets this would enable fixing include:

That last one might actually be fixed with ICU normalization enabled, so it could be a freebie!

In terms of measurable improvements, a lot of these are individually difficult to detect, since they are subtle improvements in recall. Enwiki users wouldn't necessarily notice that searching for resume failed to find articles that only had the word résumé in them. (German users did notice in the case of Bedusz/BęduszT226812 above—because there are none of the former to hide the lack of the latter.)

It would be possible to list some examples for each unpacked analyzer—like saying, Bedusz and Będusz now index as the same. The lists for each would be similar, though the specific examples would vary depending on what was in the test sample. Would that be helpful?

@TJones, would it be possible to put in a little more detail here

Clearly, that's a "no" on the "little" part. ;)

I would have written a shorter note, but I did not have the time.

Change 672567 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Refactor Analysis Config Builder

https://gerrit.wikimedia.org/r/672567

Is it possible to know based on a sampling what percentage of queries in language L are affected by an improvement to L's analyzer? i.e. after unpacking the English analyzer, 10% of queries were affected, including ones that included resume/résumé

Ideally we would know both the scale of the impact in addition to the degree and direction of that impact. I recognize measuring the latter two are not well defined, but I think it would be helpful to know how many queries we're potentially affecting at all. Not that I think we should try to optimize that number specifically, but it's useful to know, just as it's useful to know how many searchers use title matching vs full text search (where we are also not currently actively trying to increase that number in one direction or the other)

[ I have tried to be brief... but it's a struggle because it goes against my nature! ]

The problem is that changes in either the query or the index could affect results. So, resume in the query could stay the same, but still get more results because résumé in the index now matches. Or vice versa.

If we want to use this for assessing impact, rather than making a decision before implementation, a reasonable compromise might be to do a somewhat time-delayed before-and-after comparison by running half the comparison on the live site before the change and the other half after the change is pushed out to production, noting that the natural variation from edits will be more pronounced the longer the time difference between the two halves.

This approach would be less work than building a full modified index in RelForge, and much more accurate than looking at just changes in either a sample of the index or a sample of queries.

The other wrinkle is the query sample. We have some heuristics we typically use to filter bots, power users, and other queries that don't come from "normal humans", so the query sample may be somewhat less representative of all users, but also more representative of typical users (in theory).

If this sounds reasonable, I can try it for Spanish and we can see what happens.

If we want to use this for assessing impact, rather than making a decision before implementation...

Yeah, I was thinking about using this to start tracking our work right now, rather than to make decisions.

We have some heuristics we typically use to filter bots...

I don't mind refining what we're looking at further, but right now I don't think it's even clear on any level, so I was just trying to start broad to get some baselines.

running half the comparison on the live site before the change and the other half after the change is pushed out to production

If I understand this correctly, this seems like a reasonable compromise to try, starting with Spanish. What does "half the comparison" refer to exactly though?

running half the comparison on the live site before the change and the other half after the change is pushed out to production

If I understand this correctly, this seems like a reasonable compromise to try, starting with Spanish. What does "half the comparison" refer to exactly though?

I meant gathering half the comparison data. Ideally, we'd run the comparison on two static snapshots of the wiki, with one configured as "before" and one configured as "after", but that's a lot of work—especially to do it a couple dozen times. Instead, we can run a sample on the live wiki before the change (one half), and then re-run the sample after the change (the other half) and compare. There will be random changes because the wiki is always being updated, but hopefully we can see the improvements.

TJones updated the task description. (Show Details)