Page MenuHomePhabricator

Unpack Spanish Elasticsearch Analyzer
Closed, ResolvedPublic5 Estimated Story Points


See parent task for details.

Event Timeline

Change 683106 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Unpack Spanish Analyzer

Spanish Notes

  • Usual 10K sample each from Wikipedia and Wiktionary
  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades)
    • Note that "word_break_helper" is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.
  • Enabled homoglyphs and found a few examples in each sample
  • Enabled ICU normalization and saw the usual normalization
    • Lots more long-s's (ſ) in Wiktionary than expected (e.g., confeſſion), but that's not bad.
    • The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
    • Potential concerns:
      • and are frequently used ordinals that get normalized as 1a and 1o. Not too bad.
      • However, º is often used as a degree symbol: 07º45'23 → 07o45'23, which still isn't terrible.
      • gets mapped to no, which is a stop word. gets mapped to po. This isn't great, but it is already happening in the plain field, so it also isn't terrible. (The plain field also rescues nº.)
  • Enabled ICU folding (with an exception for ñ) and saw the usual foldings. No concerns.
  • Updated test fixtures for Spanish and multi-language tests.
  • Refactored building of mapping character filters. There are so many that are just dealing with dotted I after unpacking.

Tokenization/Indexing Impacts

  • Wikipedia (eswiki)
    • There's a very small impact on token counts (-0.03% out of ~2.8M); these are mostly tokens like 'nº, ª, º, which normalize to no, a, o, which are stop words (but captured by the plain field).
    • About 1.2% of tokens merged with other tokens. The tokens in queries are likely to be somewhat similar.
  • Wiktionary (eswikt)
    • There's a much bigger impact on token counts (-2.1% out of ~100K); the biggest group of these are ª in phrases like 1.ª and 2.ª ("first person", "second person", etc.), so not really something that will be reflected in queries.
    • Only about 0.2% of tokens merge with other tokens, so not a big impact on Wiktionary.

Change 683106 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Unpack Spanish Analyzer