See parent task for details.
Description
Description
Details
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Unpack German, Portuguese, and Dutch Elasticsearch Analyzers | mediawiki/extensions/CirrusSearch | master | +939 -234 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T219550 [EPIC] Harmonize language analysis across languages | |||
Resolved | Gehel | T272606 [EPIC] Unpack all Elasticsearch analyzers | |||
Resolved | TJones | T281379 Unpack German, Portuguese, and Dutch Elasticsearch Analyzers | |||
Resolved | TJones | T284185 Reindex German, Dutch, and Portugese Wikis to Enabled Unpacked Versions | |||
Resolved | TJones | T226812 de.wikipedia: search for "Bedusz" does not find "Będusz" | |||
Resolved | TJones | T104814 Appropriately ignore diacritics for German-language wikis |
Event Timeline
Comment Actions
I'm going to try to do three at once (well, sequentially, but as one patch). I've upped the points from 3 to 5... we'll see if that's reasonable!
Comment Actions
Change 692700 had a related patch set uploaded (by Tjones; author: Tjones):
[mediawiki/extensions/CirrusSearch@master] Unpack German, Portuguese, and Dutch Elasticsearch Analyzers
Comment Actions
General Notes
- Usual 10K sample each from Wikipedia and Wiktionary for each language
- Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades)
- Note that word_break_helper is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.
- Enabled homoglyphs and found a few examples in all three Wiktionary samples and the Portuguese Wikipedia sample.
- Enabled ICU normalization and saw the usual normalization in most cases (but see German Notes below)
- The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
- German required customization to maintain ß for stopword processing.
- Enabled custom ICU folding for each language, saw lots of the usual folding effects.
- Most impactful ICU folding for all three Wikipedias (and Portuguese Wiktionary) is converting curly apostrophes to straight apostrophes so that (mostly French and some English) words match either way: d'Europe vs d’Europe or Don’t vs Don't.
- Most common ICU folding for the other two Wiktionaries is removing middle dots from syllabified versions of words: Xe·no·kra·tie vs Xenokratie or qua·dra·fo·ni·scher vs quadrafonischer. (Portuguese uses periods for syllabification, so they remain.)
German Notes
General German
- ICU normalization interacts with German stop words. mußte gets filtered (as musste) and daß does not get filtered (as dass). Fortunately, a few years ago, David patched unicodeSetFilter in Elasticsearch so that it can be applied to ICU normalization as well as ICU folding!! Unfortunately, we can't use the same set of exception characters for both ICU folding and ICU normalization, because then Ä, Ö, and Ü don't get lowercased, which seems bad. It's further complicated by the fact that capital ẞ gets normalized to 'ss', rather than lowercase ß, so I mapped ẞ to ß in the same character filter need to fix the dotted-I regression.
- There is almost no impact on token counts—only 2 tokens from dewiki were lost (Japanese prolonged sound marks used in isolation) and none from dewikt.
German Wikipedia:
- Most common ICU normalization is removing soft hyphens, which are generally invisible, but also more common in German because of the prevalence of long words.
- It's German, so of course there are tokens like rollstuhlbasketballnationalmannschaft, but among the longer tokens were also some that would benefit from word_break_helper, like la_pasion_por_goya_en_zuloaga_y_su_circulo.
- About 0.3% of tokens (0.6% of unique tokens) merged with others in dewiki.
German Wikitionary:
- Most common ICU normalizations are long-s's (ſ) (e.g., Auguſt), but that's not bad.
- The longest tokens in my German Wiktionary sample are of this sort: \uD800\uDF30\uD800\uDF3D\uD800\uDF33\uD800\uDF30\uD800\uDF43\uD800\uDF44\uD800\uDF30\uD800\uDF3F\uD800\uDF39\uD800\uDF3D, which is the internal representation of Gothic 𐌰𐌽𐌳𐌰𐍃𐍄𐌰𐌿𐌹𐌽.
- About 2.2% of tokens (10.6% of unique tokens) merged with others in dewikt—this number is very large because of the general pattern of merging syllabified words with their unsyllabified versions.
Dutch Notes
General Dutch
- Most common ICU normalization are removing soft hyphens and normalizing ß to 'ss'. The ss versions of words seem to mostly be German, rather than Dutch, so that's a good thing.
- There is almost no impact on token counts—only 6 tokens from nlwikt were added (homoglyphs) and none from nlwiki.
Dutch Wikipedia:
- Like German, Dutch has its share of long words, like cybercriminaliteitsonderzoek.
- About 0.2% of tokens (0.4% of unique tokens) merged with others in nlwiki.
Dutch Wiktionary:
- The longest words in Wiktionary are regular long words, with syllable breaks added, like zes·hon·derd·vier·en·der·tig·jes.
- About 3.1% of tokens (12.1% of unique tokens) merged with others in nlwikt—this number is very large because of the general pattern of merging syllabified words with their unsyllabified versions.
Portuguese Notes
Portuguese Wikipedia:
- There's a very small impact on token counts (-0.05% out of ~1.9M); these are mostly tokens like nº, nª, ª, º, which normalize to no, na, a, o, which are stop words (but captured by the plain field).
- The most common ICU normalizations are ª and º being converted to a and o, ß being converted to ss, and fi and fl ligatures being expanded to fi and fl.
- Long tokens are a mix of \u encoded Cuneiform, file names with underscores, and domain names (words separated by periods).
- About 0.5% of tokens (0.6% of unique tokens) merged with others in ptwiki.
Portuguese Wiktionary:
- There's a very small impact on token counts (0.008% out of ~147K), which are mostly homoglyphs.
- Longest words are a mix of syllabified words, like co.ro.no.gra.fo.po.la.ri.me.tr, and \u encoded scripts like \uD800\uDF00\uD800\uDF0D\uD800\uDF15\uD800\uDF04\uD800\uDF13 (Old Italic 𐌀𐌍𐌕𐌄𐌓)
- About 0.8% of tokens (1.3% of unique tokens) merged with others in ptwiki.
Comment Actions
Change 692700 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Unpack German, Portuguese, and Dutch Elasticsearch Analyzers