Unpack Spanish Elasticsearch Analyzer
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	TJones
	Mar 17 2021, 8:02 PM

Description

See parent task for details.

	Subject	Repo	Branch	Lines +/-
	Unpack Spanish Analyzer	mediawiki/extensions/CirrusSearch	master	+134 -51

Status	Assigned	Task
Open	None	T219550 [EPIC] Harmonize language analysis across languages
Resolved	Gehel	T272606 [EPIC] Unpack all Elasticsearch analyzers
Resolved	TJones	T277699 Unpack Spanish Elasticsearch Analyzer
Resolved	TJones	T282808 Reindex Spanish-language wikis to enable unpacked version of Spanish analysis

• MPhamWMF set the point value for this task to 5.Mar 22 2021, 3:30 PM

Change 683106 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Unpack Spanish Analyzer

Spanish Notes

Usual 10K sample each from Wikipedia and Wiktionary
Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades)
- Note that "word_break_helper" is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.
Enabled homoglyphs and found a few examples in each sample
Enabled ICU normalization and saw the usual normalization
- Lots more long-s's (ſ) in Wiktionary than expected (e.g., confeſſion), but that's not bad.
- The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
- Potential concerns:
  - 1ª and 1º are frequently used ordinals that get normalized as 1a and 1o. Not too bad.
  - However, º is often used as a degree symbol: 07º45'23 → 07o45'23, which still isn't terrible.
  - nº gets mapped to no, which is a stop word. pº gets mapped to po. This isn't great, but it is already happening in the plain field, so it also isn't terrible. (The plain field also rescues nº.)
Enabled ICU folding (with an exception for ñ) and saw the usual foldings. No concerns.
Updated test fixtures for Spanish and multi-language tests.

Refactored building of mapping character filters. There are so many that are just dealing with dotted I after unpacking.

Tokenization/Indexing Impacts

Wikipedia (eswiki)
- There's a very small impact on token counts (-0.03% out of ~2.8M); these are mostly tokens like 'nº, ª, º, which normalize to no, a, o, which are stop words (but captured by the plain field).
- About 1.2% of tokens merged with other tokens. The tokens in queries are likely to be somewhat similar.
Wiktionary (eswikt)
- There's a much bigger impact on token counts (-2.1% out of ~100K); the biggest group of these are ª in phrases like 1.ª and 2.ª ("first person", "second person", etc.), so not really something that will be reflected in queries.
- Only about 0.2% of tokens merge with other tokens, so not a big impact on Wiktionary.