See parent task for details.
Customize query in gerrit
|Unpack German, Portuguese, and Dutch Elasticsearch Analyzers||mediawiki/extensions/CirrusSearch||master||+939 -234|
|Open||None||T219550 [EPIC] Harmonize language analysis across languages|
|Resolved||Gehel||T272606 [EPIC] Unpack all Elasticsearch analyzers|
|Resolved||TJones||T281379 Unpack German, Portuguese, and Dutch Elasticsearch Analyzers|
|Resolved||TJones||T284185 Reindex German, Dutch, and Portugese Wikis to Enabled Unpacked Versions|
|Resolved||TJones||T226812 de.wikipedia: search for "Bedusz" does not find "Będusz"|
|Resolved||TJones||T104814 Appropriately ignore diacritics for German-language wikis|
- Mentioned In
- T226812: de.wikipedia: search for "Bedusz" does not find "Będusz"
T104814: Appropriately ignore diacritics for German-language wikis
T147505: [tracking] CirrusSearch: what is updated during re-indexing
T284185: Reindex German, Dutch, and Portugese Wikis to Enabled Unpacked Versions
T87136: ~"daß" should not match "dass"
T272606: [EPIC] Unpack all Elasticsearch analyzers
- Usual 10K sample each from Wikipedia and Wiktionary for each language
- Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades)
- Note that word_break_helper is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.
- Enabled homoglyphs and found a few examples in all three Wiktionary samples and the Portuguese Wikipedia sample.
- Enabled ICU normalization and saw the usual normalization in most cases (but see German Notes below)
- The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
- German required customization to maintain ß for stopword processing.
- Enabled custom ICU folding for each language, saw lots of the usual folding effects.
- Most impactful ICU folding for all three Wikipedias (and Portuguese Wiktionary) is converting curly apostrophes to straight apostrophes so that (mostly French and some English) words match either way: d'Europe vs d’Europe or Don’t vs Don't.
- Most common ICU folding for the other two Wiktionaries is removing middle dots from syllabified versions of words: Xe·no·kra·tie vs Xenokratie or qua·dra·fo·ni·scher vs quadrafonischer. (Portuguese uses periods for syllabification, so they remain.)
- ICU normalization interacts with German stop words. mußte gets filtered (as musste) and daß does not get filtered (as dass). Fortunately, a few years ago, David patched unicodeSetFilter in Elasticsearch so that it can be applied to ICU normalization as well as ICU folding!! Unfortunately, we can't use the same set of exception characters for both ICU folding and ICU normalization, because then Ä, Ö, and Ü don't get lowercased, which seems bad. It's further complicated by the fact that capital ẞ gets normalized to 'ss', rather than lowercase ß, so I mapped ẞ to ß in the same character filter need to fix the dotted-I regression.
- There is almost no impact on token counts—only 2 tokens from dewiki were lost (Japanese prolonged sound marks used in isolation) and none from dewikt.
- Most common ICU normalization is removing soft hyphens, which are generally invisible, but also more common in German because of the prevalence of long words.
- It's German, so of course there are tokens like rollstuhlbasketballnationalmannschaft, but among the longer tokens were also some that would benefit from word_break_helper, like la_pasion_por_goya_en_zuloaga_y_su_circulo.
- About 0.3% of tokens (0.6% of unique tokens) merged with others in dewiki.
- Most common ICU normalizations are long-s's (ſ) (e.g., Auguſt), but that's not bad.
- The longest tokens in my German Wiktionary sample are of this sort: \uD800\uDF30\uD800\uDF3D\uD800\uDF33\uD800\uDF30\uD800\uDF43\uD800\uDF44\uD800\uDF30\uD800\uDF3F\uD800\uDF39\uD800\uDF3D, which is the internal representation of Gothic 𐌰𐌽𐌳𐌰𐍃𐍄𐌰𐌿𐌹𐌽.
- About 2.2% of tokens (10.6% of unique tokens) merged with others in dewikt—this number is very large because of the general pattern of merging syllabified words with their unsyllabified versions.
- Most common ICU normalization are removing soft hyphens and normalizing ß to 'ss'. The ss versions of words seem to mostly be German, rather than Dutch, so that's a good thing.
- There is almost no impact on token counts—only 6 tokens from nlwikt were added (homoglyphs) and none from nlwiki.
- Like German, Dutch has its share of long words, like cybercriminaliteitsonderzoek.
- About 0.2% of tokens (0.4% of unique tokens) merged with others in nlwiki.
- The longest words in Wiktionary are regular long words, with syllable breaks added, like zes·hon·derd·vier·en·der·tig·jes.
- About 3.1% of tokens (12.1% of unique tokens) merged with others in nlwikt—this number is very large because of the general pattern of merging syllabified words with their unsyllabified versions.
- There's a very small impact on token counts (-0.05% out of ~1.9M); these are mostly tokens like nº, nª, ª, º, which normalize to no, na, a, o, which are stop words (but captured by the plain field).
- The most common ICU normalizations are ª and º being converted to a and o, ß being converted to ss, and ﬁ and ﬂ ligatures being expanded to fi and fl.
- Long tokens are a mix of \u encoded Cuneiform, file names with underscores, and domain names (words separated by periods).
- About 0.5% of tokens (0.6% of unique tokens) merged with others in ptwiki.
- There's a very small impact on token counts (0.008% out of ~147K), which are mostly homoglyphs.
- Longest words are a mix of syllabified words, like co.ro.no.gra.fo.po.la.ri.me.tr, and \u encoded scripts like \uD800\uDF00\uD800\uDF0D\uD800\uDF15\uD800\uDF04\uD800\uDF13 (Old Italic 𐌀𐌍𐌕𐌄𐌓)
- About 0.8% of tokens (1.3% of unique tokens) merged with others in ptwiki.