Page MenuHomePhabricator

Unpack German, Portuguese, and Dutch Elasticsearch Analyzers
Closed, ResolvedPublic5 Estimated Story Points

Description

See parent task for details.

Event Timeline

TJones set the point value for this task to 5.
MPhamWMF moved this task from needs triage to Language Stuff on the Discovery-Search board.
TJones removed the point value for this task.May 3 2021, 3:27 PM
TJones renamed this task from Unpack German Elasticsearch Analyzer to Unpack German, Portuguese, and Dutch Elasticsearch Analyzers.May 13 2021, 3:04 PM
TJones claimed this task.
TJones changed the point value for this task from 3 to 5.

I'm going to try to do three at once (well, sequentially, but as one patch). I've upped the points from 3 to 5... we'll see if that's reasonable!

Change 692700 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Unpack German, Portuguese, and Dutch Elasticsearch Analyzers

https://gerrit.wikimedia.org/r/692700

General Notes

  • Usual 10K sample each from Wikipedia and Wiktionary for each language
  • Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades)
  • Note that word_break_helper is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.
  • Enabled homoglyphs and found a few examples in all three Wiktionary samples and the Portuguese Wikipedia sample.
  • Enabled ICU normalization and saw the usual normalization in most cases (but see German Notes below)
    • The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
    • German required customization to maintain ß for stopword processing.
  • Enabled custom ICU folding for each language, saw lots of the usual folding effects.
    • Most impactful ICU folding for all three Wikipedias (and Portuguese Wiktionary) is converting curly apostrophes to straight apostrophes so that (mostly French and some English) words match either way: d'Europe vs d’Europe or Don’t vs Don't.
    • Most common ICU folding for the other two Wiktionaries is removing middle dots from syllabified versions of words: Xe·no·kra·tie vs Xenokratie or qua·dra·fo·ni·scher vs quadrafonischer. (Portuguese uses periods for syllabification, so they remain.)

German Notes

General German

  • ICU normalization interacts with German stop words. mußte gets filtered (as musste) and daß does not get filtered (as dass). Fortunately, a few years ago, David patched unicodeSetFilter in Elasticsearch so that it can be applied to ICU normalization as well as ICU folding!! Unfortunately, we can't use the same set of exception characters for both ICU folding and ICU normalization, because then Ä, Ö, and Ü don't get lowercased, which seems bad. It's further complicated by the fact that capital ẞ gets normalized to 'ss', rather than lowercase ß, so I mapped ẞ to ß in the same character filter need to fix the dotted-I regression.
  • There is almost no impact on token counts—only 2 tokens from dewiki were lost (Japanese prolonged sound marks used in isolation) and none from dewikt.

German Wikipedia:

  • Most common ICU normalization is removing soft hyphens, which are generally invisible, but also more common in German because of the prevalence of long words.
  • It's German, so of course there are tokens like rollstuhlbasketballnationalmannschaft, but among the longer tokens were also some that would benefit from word_break_helper, like la_pasion_por_goya_en_zuloaga_y_su_circulo.
  • About 0.3% of tokens (0.6% of unique tokens) merged with others in dewiki.

German Wikitionary:

  • Most common ICU normalizations are long-s's (ſ) (e.g., Auguſt), but that's not bad.
  • The longest tokens in my German Wiktionary sample are of this sort: \uD800\uDF30\uD800\uDF3D\uD800\uDF33\uD800\uDF30\uD800\uDF43\uD800\uDF44\uD800\uDF30\uD800\uDF3F\uD800\uDF39\uD800\uDF3D, which is the internal representation of Gothic 𐌰𐌽𐌳𐌰𐍃𐍄𐌰𐌿𐌹𐌽.
  • About 2.2% of tokens (10.6% of unique tokens) merged with others in dewikt—this number is very large because of the general pattern of merging syllabified words with their unsyllabified versions.

Dutch Notes

General Dutch

  • Most common ICU normalization are removing soft hyphens and normalizing ß to 'ss'. The ss versions of words seem to mostly be German, rather than Dutch, so that's a good thing.
  • There is almost no impact on token counts—only 6 tokens from nlwikt were added (homoglyphs) and none from nlwiki.

Dutch Wikipedia:

  • Like German, Dutch has its share of long words, like cybercriminaliteitsonderzoek.
  • About 0.2% of tokens (0.4% of unique tokens) merged with others in nlwiki.

Dutch Wiktionary:

  • The longest words in Wiktionary are regular long words, with syllable breaks added, like zes·hon·derd·vier·en·der·tig·jes.
  • About 3.1% of tokens (12.1% of unique tokens) merged with others in nlwikt—this number is very large because of the general pattern of merging syllabified words with their unsyllabified versions.

Portuguese Notes

Portuguese Wikipedia:

  • There's a very small impact on token counts (-0.05% out of ~1.9M); these are mostly tokens like nº, nª, ª, º, which normalize to no, na, a, o, which are stop words (but captured by the plain field).
  • The most common ICU normalizations are ª and º being converted to a and o, ß being converted to ss, and fi and fl ligatures being expanded to fi and fl.
  • Long tokens are a mix of \u encoded Cuneiform, file names with underscores, and domain names (words separated by periods).
  • About 0.5% of tokens (0.6% of unique tokens) merged with others in ptwiki.

Portuguese Wiktionary:

  • There's a very small impact on token counts (0.008% out of ~147K), which are mostly homoglyphs.
  • Longest words are a mix of syllabified words, like co.ro.no.gra.fo.po.la.ri.me.tr, and \u encoded scripts like \uD800\uDF00\uD800\uDF0D\uD800\uDF15\uD800\uDF04\uD800\uDF13 (Old Italic 𐌀𐌍𐌕𐌄𐌓)
  • About 0.8% of tokens (1.3% of unique tokens) merged with others in ptwiki.

Change 692700 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Unpack German, Portuguese, and Dutch Elasticsearch Analyzers

https://gerrit.wikimedia.org/r/692700