See parent task for details.
Description
Description
Details
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Unpack Spanish Analyzer | mediawiki/extensions/CirrusSearch | master | +134 -51 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T219550 [EPIC] Harmonize language analysis across languages | |||
Resolved | Gehel | T272606 [EPIC] Unpack all Elasticsearch analyzers | |||
Resolved | TJones | T277699 Unpack Spanish Elasticsearch Analyzer | |||
Resolved | TJones | T282808 Reindex Spanish-language wikis to enable unpacked version of Spanish analysis |
Event Timeline
Comment Actions
Change 683106 had a related patch set uploaded (by Tjones; author: Tjones):
[mediawiki/extensions/CirrusSearch@master] Unpack Spanish Analyzer
Comment Actions
Spanish Notes
- Usual 10K sample each from Wikipedia and Wiktionary
- Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades)
- Note that "word_break_helper" is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.
- Enabled homoglyphs and found a few examples in each sample
- Enabled ICU normalization and saw the usual normalization
- Lots more long-s's (ſ) in Wiktionary than expected (e.g., confeſſion), but that's not bad.
- The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
- Potential concerns:
- 1ª and 1º are frequently used ordinals that get normalized as 1a and 1o. Not too bad.
- However, º is often used as a degree symbol: 07º45'23 → 07o45'23, which still isn't terrible.
- nº gets mapped to no, which is a stop word. pº gets mapped to po. This isn't great, but it is already happening in the plain field, so it also isn't terrible. (The plain field also rescues nº.)
- Enabled ICU folding (with an exception for ñ) and saw the usual foldings. No concerns.
- Updated test fixtures for Spanish and multi-language tests.
- Refactored building of mapping character filters. There are so many that are just dealing with dotted I after unpacking.
Tokenization/Indexing Impacts
- Wikipedia (eswiki)
- There's a very small impact on token counts (-0.03% out of ~2.8M); these are mostly tokens like 'nº, ª, º, which normalize to no, a, o, which are stop words (but captured by the plain field).
- About 1.2% of tokens merged with other tokens. The tokens in queries are likely to be somewhat similar.
- Wiktionary (eswikt)
- There's a much bigger impact on token counts (-2.1% out of ~100K); the biggest group of these are ª in phrases like 1.ª and 2.ª ("first person", "second person", etc.), so not really something that will be reflected in queries.
- Only about 0.2% of tokens merge with other tokens, so not a big impact on Wiktionary.
Comment Actions
Change 683106 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Unpack Spanish Analyzer