Page MenuHomePhabricator

Unpack Basque, Catalan, Danish Elasticsearch Analyzers
Closed, ResolvedPublic5 Estimated Story Points

Description

See parent task for details.

(These three were chosen next because they use the Latin alphabet, aren't too complex, and are alphabetically next.)

Event Timeline

TJones renamed this task from Unpack Basque, Catalan, Danish to Unpack Basque, Catalan, Danish Elasticsearch Analyzers.May 21 2021, 3:30 PM
TJones set the point value for this task to 5.

Change 698600 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Unpack Basque, Catalan, Danish Elasticsearch Analyzers

https://gerrit.wikimedia.org/r/698600

Basque, Catalan, and Danish Notes

  • Usual 10K sample over a 1–4 week period from Wikipedia and Wiktionary for each language.
  • Usual distribution of tokens—lots of CJK one-character tokens; long tokens are URLs, \u encoded tokens, file names, numbers, etc.
  • Stemming observations:
    • Catalan Wikipedia had up to 180(!) distinct tokens in stemming groups.
    • Basque Wikipedia had up to 200(!!) distinct tokens in stemming groups.
    • Danish Wikipedia had a mere 30 distinct tokens in its largest stemming group.
  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades).
    • Note that word_break_helper is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.
  • Enabled homoglyphs and found a handful of examples in all six samples.
    • Catalan Wikipedia had two mixed–Cyrillic/Greek/Latin tokens!
    • Found Greek/Latin examples in all three Wikipedias and Danish Wiktionary, and Greek/Cyrillic in Catalan Wikipedia.
  • Enabled ICU normalization and saw the usual normalizations.
    • The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
    • Most common normalizations: lots of ß and invisibles (soft-hyphen, bidi marks, etc.) all around; 1ª, 1º for Basque and Catalan Wikipedias, and some full-width characters for Catalan Wikipedia.
    • Catalan Wikipedia also loses a lot (12K+ out of 4.1M) of "E⎵" and "O⎵" tokens, where ⎵ represents a "zero-width no-break space" (U+FEFF). "e" and "o" are stop words—"o" means "or", but "e" just seems to refer to the letter; weird. The versions with U+FEFF seem to be used exclusively in coordinates ("E" stands for "est", which is "east"; "O" stands for "oest", which is "west"). Since the coords are very exact (e.g., "42.176388888889°N,3.0416666666667°E"), I don't think many people are searching for them specifically, and if they are, the plain field will help them out.
  • Enabled custom ICU folding for each language, saw lots of the usual folding effects.
    • Exempted [ñ] for Basque and [æ, ø, å] for Danish. [ç] was unclear for Basque and Catalan, but I let it be folded to c for both for the first pass.
    • ˈstressˌmarks, ɪᴘᴀ ɕɦɑʀɐƈʈɛʁʂ, and dìáçrïťɨčãł marks were normalized all around.
    • Basque: ç → c is not 100% clear in all cases, but seems to be overall beneficial.
    • Catalan Wiktionary: ç → c is not 100% clear in all cases, but seems to be overall beneficial.
    • Catalan Wikipedia:
      • Lots of high-impact collisions (ten or more distinct words merged into another group—often two largish groups merging). They came in three flavors:
        • The majority are ç → c; most look ok
        • A few ñ → n; these look good; mostly low frequency Spanish cognates merging with Catalan ones
        • Single letters merging with diacritical variants, like [eː, e̞, e͂, ê, ē, Ĕ, ɛ, ẹ, ẽ, ẽː] merging with [È, É, è, é]
      • Surprisingly, lots of Japanese Katakana changes, deleting the prolonged sound mark ー.
    • Danish: Also straightened a fair number of curly quotes.
Overall Impact
  • There were few token count differences in most cases, mostly from extra homoglyph tokens or fewer solo combining characters. (But see Catalan Wikipedia.)
  • ICU folding is the biggest source of changes in all wikis—as expected.
  • Generally, the merges that resulted from ICU folding were significant, but not extreme (0.5% to 1.5% of tokens being redistributed into 1% to 3% of stemming groups).
    • Basque Wiktionary: 649 tokens (1.111% of tokens) were merged into 473 groups (2.330% of groups)
    • Basque Wikipedia: 27,620 tokens (1.175% of tokens) were merged into 3,244 groups (1.325% of groups)
    • Catalan Wiktionary: 840 tokens (0.520% of tokens) were merged into 400 groups (1.181% of groups)
    • Catalan Wikipedia:
      • 12.7K fewer tokens out of 4.1M (see "E⎵" and "O⎵" above)
      • 39,099 tokens (0.943% of tokens) were merged into 2,513 groups (0.967% of groups)
    • Danish Wiktionary: 1,515 tokens (1.387% of tokens) were merged into 904 groups (2.788% of groups)
    • Danish Wikipedia: 20,778 tokens (0.611% of tokens) were merged into 2,990 groups (1.023% of groups)

Change 698600 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Unpack Basque, Catalan, Danish Elasticsearch Analyzers

https://gerrit.wikimedia.org/r/698600