Unpack German, Portuguese, and Dutch Elasticsearch Analyzers
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	TJones
	Apr 28 2021, 3:25 PM

Description

See parent task for details.

Details

	Subject	Repo	Branch	Lines +/-
	Unpack German, Portuguese, and Dutch Elasticsearch Analyzers	mediawiki/extensions/CirrusSearch	master	+939 -234

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T219550 [EPIC] Harmonize language analysis across languages
Resolved	Gehel	T272606 [EPIC] Unpack all Elasticsearch analyzers
Resolved	TJones	T281379 Unpack German, Portuguese, and Dutch Elasticsearch Analyzers
Resolved	TJones	T284185 Reindex German, Dutch, and Portugese Wikis to Enabled Unpacked Versions
Resolved	TJones	T226812 de.wikipedia: search for "Bedusz" does not find "Będusz"
Resolved	TJones	T104814 Appropriately ignore diacritics for German-language wikis

Event Timeline

TJones created this task.Apr 28 2021, 3:25 PM

TJones set the point value for this task to 5.

• MPhamWMF triaged this task as High priority.Apr 28 2021, 4:55 PM

• MPhamWMF moved this task from needs triage to Language Stuff on the Discovery-Search board.

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.May 3 2021, 3:11 PM

TJones removed the point value for this task.May 3 2021, 3:27 PM

• MPhamWMF set the point value for this task to 3.May 3 2021, 3:55 PM

• MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

TJones renamed this task from Unpack German Elasticsearch Analyzer to Unpack German, Portuguese, and Dutch Elasticsearch Analyzers.May 13 2021, 3:04 PM

TJones claimed this task.

TJones moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

TJones changed the point value for this task from 3 to 5.

I'm going to try to do three at once (well, sequentially, but as one patch). I've upped the points from 3 to 5... we'll see if that's reasonable!

TJones mentioned this in T272606: [EPIC] Unpack all Elasticsearch analyzers.May 13 2021, 6:16 PM

Change 692700 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Unpack German, Portuguese, and Dutch Elasticsearch Analyzers

https://gerrit.wikimedia.org/r/692700

gerritbot added a project: Patch-For-Review.May 18 2021, 8:04 PM

TJones mentioned this in T87136: ~"daß" should not match "dass".May 18 2021, 8:23 PM

General Notes

Usual 10K sample each from Wikipedia and Wiktionary for each language
Unpacking was uneventful (disabled homoglyph and ICU normalization upgrades)
Note that word_break_helper is no longer configured. However, it doesn't do anything with a monolithic analyzer, so there is no change in functionality.
Enabled homoglyphs and found a few examples in all three Wiktionary samples and the Portuguese Wikipedia sample.
Enabled ICU normalization and saw the usual normalization in most cases (but see German Notes below)
- The expected regression: Dotted I (İ) is lowercased as i̇ — fixed with a char_filter map
- German required customization to maintain ß for stopword processing.
Enabled custom ICU folding for each language, saw lots of the usual folding effects.
- Most impactful ICU folding for all three Wikipedias (and Portuguese Wiktionary) is converting curly apostrophes to straight apostrophes so that (mostly French and some English) words match either way: d'Europe vs d’Europe or Don’t vs Don't.
- Most common ICU folding for the other two Wiktionaries is removing middle dots from syllabified versions of words: Xe·no·kra·tie vs Xenokratie or qua·dra·fo·ni·scher vs quadrafonischer. (Portuguese uses periods for syllabification, so they remain.)

German Notes

General German

ICU normalization interacts with German stop words. mußte gets filtered (as musste) and daß does not get filtered (as dass). Fortunately, a few years ago, David patched unicodeSetFilter in Elasticsearch so that it can be applied to ICU normalization as well as ICU folding!! Unfortunately, we can't use the same set of exception characters for both ICU folding and ICU normalization, because then Ä, Ö, and Ü don't get lowercased, which seems bad. It's further complicated by the fact that capital ẞ gets normalized to 'ss', rather than lowercase ß, so I mapped ẞ to ß in the same character filter need to fix the dotted-I regression.
There is almost no impact on token counts—only 2 tokens from dewiki were lost (Japanese prolonged sound marks used in isolation) and none from dewikt.

German Wikipedia:

Most common ICU normalization is removing soft hyphens, which are generally invisible, but also more common in German because of the prevalence of long words.
It's German, so of course there are tokens like rollstuhlbasketballnationalmannschaft, but among the longer tokens were also some that would benefit from word_break_helper, like la_pasion_por_goya_en_zuloaga_y_su_circulo.
About 0.3% of tokens (0.6% of unique tokens) merged with others in dewiki.

German Wikitionary:

Most common ICU normalizations are long-s's (ſ) (e.g., Auguſt), but that's not bad.
The longest tokens in my German Wiktionary sample are of this sort: \uD800\uDF30\uD800\uDF3D\uD800\uDF33\uD800\uDF30\uD800\uDF43\uD800\uDF44\uD800\uDF30\uD800\uDF3F\uD800\uDF39\uD800\uDF3D, which is the internal representation of Gothic 𐌰𐌽𐌳𐌰𐍃𐍄𐌰𐌿𐌹𐌽.
About 2.2% of tokens (10.6% of unique tokens) merged with others in dewikt—this number is very large because of the general pattern of merging syllabified words with their unsyllabified versions.

Dutch Notes

General Dutch

Most common ICU normalization are removing soft hyphens and normalizing ß to 'ss'. The ss versions of words seem to mostly be German, rather than Dutch, so that's a good thing.
There is almost no impact on token counts—only 6 tokens from nlwikt were added (homoglyphs) and none from nlwiki.

Dutch Wikipedia:

Like German, Dutch has its share of long words, like cybercriminaliteitsonderzoek.
About 0.2% of tokens (0.4% of unique tokens) merged with others in nlwiki.

Dutch Wiktionary:

The longest words in Wiktionary are regular long words, with syllable breaks added, like zes·hon·derd·vier·en·der·tig·jes.
About 3.1% of tokens (12.1% of unique tokens) merged with others in nlwikt—this number is very large because of the general pattern of merging syllabified words with their unsyllabified versions.

Portuguese Notes

Portuguese Wikipedia:

There's a very small impact on token counts (-0.05% out of ~1.9M); these are mostly tokens like nº, nª, ª, º, which normalize to no, na, a, o, which are stop words (but captured by the plain field).
The most common ICU normalizations are ª and º being converted to a and o, ß being converted to ss, and ﬁ and ﬂ ligatures being expanded to fi and fl.
Long tokens are a mix of \u encoded Cuneiform, file names with underscores, and domain names (words separated by periods).
About 0.5% of tokens (0.6% of unique tokens) merged with others in ptwiki.

Portuguese Wiktionary:

There's a very small impact on token counts (0.008% out of ~147K), which are mostly homoglyphs.
Longest words are a mix of syllabified words, like co.ro.no.gra.fo.po.la.ri.me.tr, and \u encoded scripts like \uD800\uDF00\uD800\uDF0D\uD800\uDF15\uD800\uDF04\uD800\uDF13 (Old Italic 𐌀𐌍𐌕𐌄𐌓)
About 0.8% of tokens (1.3% of unique tokens) merged with others in ptwiki.