Unpack Basque, Catalan, Danish Elasticsearch Analyzers
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	TJones
	May 21 2021, 3:27 PM

Description

See parent task for details.

(These three were chosen next because they use the Latin alphabet, aren't too complex, and are alphabetically next.)

Details

	Subject	Repo	Branch	Lines +/-
	Unpack Basque, Catalan, Danish Elasticsearch Analyzers	mediawiki/extensions/CirrusSearch	master	+1 K -271

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T219550 [EPIC] Harmonize language analysis across languages
Resolved	Gehel	T272606 [EPIC] Unpack all Elasticsearch analyzers
Resolved	TJones	T283366 Unpack Basque, Catalan, Danish Elasticsearch Analyzers
Resolved	TJones	T284691 Reindex Basque, Catalan, Danish wikis to enable unpacked versions

Event Timeline

TJones created this task.May 21 2021, 3:27 PM

TJones mentioned this in T272606: [EPIC] Unpack all Elasticsearch analyzers.

TJones renamed this task from Unpack Basque, Catalan, Danish to Unpack Basque, Catalan, Danish Elasticsearch Analyzers.May 21 2021, 3:30 PM

TJones edited projects, added Discovery-Search; removed Discovery-Search (Current work).

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.

TJones set the point value for this task to 5.

TJones moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

TJones claimed this task.May 25 2021, 5:24 PM

TJones moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

Change 698600 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Unpack Basque, Catalan, Danish Elasticsearch Analyzers

https://gerrit.wikimedia.org/r/698600

gerritbot added a project: Patch-For-Review.Jun 7 2021, 6:45 PM

Basque, Catalan, and Danish Notes

Usual 10K sample over a 1–4 week period from Wikipedia and Wiktionary for each language.
Usual distribution of tokens—lots of CJK one-character tokens; long tokens are URLs, \u encoded tokens, file names, numbers, etc.
Stemming observations:
- Catalan Wikipedia had up to 180(!) distinct tokens in stemming groups.
- Basque Wikipedia had up to 200(!!) distinct tokens in stemming groups.
- Danish Wikipedia had a mere 30 distinct tokens in its largest stemming group.
Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).
- Note that word_break_helper is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.
Enabled homoglyphs and found a handful of examples in all six samples.
- Catalan Wikipedia had two mixed–Cyrillic/Greek/Latin tokens!
- Found Greek/Latin examples in all three Wikipedias and Danish Wiktionary, and Greek/Cyrillic in Catalan Wikipedia.
Enabled ICU normalization and saw the usual normalizations.
- The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
- Most common normalizations: lots of ß and invisibles (soft-hyphen, bidi marks, etc.) all around; 1ª, 1º for Basque and Catalan Wikipedias, and some full-width characters for Catalan Wikipedia.
- Catalan Wikipedia also loses a lot (12K+ out of 4.1M) of "E⎵" and "O⎵" tokens, where ⎵ represents a "zero-width no-break space" (U+FEFF). "e" and "o" are stop words—"o" means "or", but "e" just seems to refer to the letter; weird. The versions with U+FEFF seem to be used exclusively in coordinates ("E" stands for "est", which is "east"; "O" stands for "oest", which is "west"). Since the coords are very exact (e.g., "42.176388888889°N,3.0416666666667°E"), I don't think many people are searching for them specifically, and if they are, the plain field will help them out.
Enabled custom ICU folding for each language, saw lots of the usual folding effects.
- Exempted [ñ] for Basque and [æ, ø, å] for Danish. [ç] was unclear for Basque and Catalan, but I let it be folded to c for both for the first pass.
- ˈstressˌmarks, ɪᴘᴀ ɕɦɑʀɐƈʈɛʁʂ, and dìáçrïťɨčãł marks were normalized all around.
- Basque: ç → c is not 100% clear in all cases, but seems to be overall beneficial.
- Catalan Wiktionary: ç → c is not 100% clear in all cases, but seems to be overall beneficial.
- Catalan Wikipedia:
  - Lots of high-impact collisions (ten or more distinct words merged into another group—often two largish groups merging). They came in three flavors:
    - The majority are ç → c; most look ok
    - A few ñ → n; these look good; mostly low frequency Spanish cognates merging with Catalan ones
    - Single letters merging with diacritical variants, like [eː, e̞, e͂, ê, ē, Ĕ, ɛ, ẹ, ẽ, ẽː] merging with [È, É, è, é]
  - Surprisingly, lots of Japanese Katakana changes, deleting the prolonged sound mark ー.
- Danish: Also straightened a fair number of curly quotes.

Overall Impact

There were few token count differences in most cases, mostly from extra homoglyph tokens or fewer solo combining characters. (But see Catalan Wikipedia.)
ICU folding is the biggest source of changes in all wikis—as expected.
Generally, the merges that resulted from ICU folding were significant, but not extreme (0.5% to 1.5% of tokens being redistributed into 1% to 3% of stemming groups).
- Basque Wiktionary: 649 tokens (1.111% of tokens) were merged into 473 groups (2.330% of groups)
- Basque Wikipedia: 27,620 tokens (1.175% of tokens) were merged into 3,244 groups (1.325% of groups)
- Catalan Wiktionary: 840 tokens (0.520% of tokens) were merged into 400 groups (1.181% of groups)
- Catalan Wikipedia:
  - 12.7K fewer tokens out of 4.1M (see "E⎵" and "O⎵" above)
  - 39,099 tokens (0.943% of tokens) were merged into 2,513 groups (0.967% of groups)
- Danish Wiktionary: 1,515 tokens (1.387% of tokens) were merged into 904 groups (2.788% of groups)
- Danish Wikipedia: 20,778 tokens (0.611% of tokens) were merged into 2,990 groups (1.023% of groups)