Page MenuHomePhabricator

Properly handle language-specific lowercasing in language analyzers
Closed, ResolvedPublic

Description

While looking into unpacking the Greek analysis chain to add a filter for zero-length tokens (see the parent task: T203117), I ran into the fact that the Greek "lowercase" filter does more than lowercasing—it also converts final sigma (ς) to regular sigma (σ) and, very importantly, removes some very common Greek diacritics (particularly tonos, but also dialytika) that are not removed by ICU normalization (which usually replaces the lowercase filter in our analysis chains).

Rather than hack together a kludge to only address the Greek "text" analysis chain, I wanted to properly address language-specific lowercasing (which occurs for Turkish and Irish, too). Everywhere that we replace "lowercase" with "icu_normalization" we are losing out on this language-specific normalization. We already created a partial work around for Greek by enabling ICU Folding for the plain field, but the Greek "lowercase" normalization should actually happen in many other fields, too.

I'll report on the testing for Turkish and Irish here and the testing of Greek (along with the rest of the Greek testing) in the parent task.

Event Timeline

Change 494846 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/extensions/CirrusSearch@master] Add Greek empty-token filter and keep lang-specific lowercasing

https://gerrit.wikimedia.org/r/494846

After refactoring the lowercase-to-ICU-normalization upgrade code for Greek (T203117) so that the lowercase filter is kept if it is language-specific, I needed to test it for the other language-specific cases: Turkish and Irish. The impact is positive but small because it is limited to the plain field and other fields besides the text field (where the lang-specific lowercasing is already in effect because the analyzers have not been unpacked). Full details on MediaWiki.

Change 494846 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Add Greek empty-token filter and keep lang-specific lowercasing

https://gerrit.wikimedia.org/r/494846