Page MenuHomePhabricator

Use the icu_folding filter if available instead of asciifolding
Closed, ResolvedPublic

Description

This is an issue for non-latin wikis, the default asciifolding we use will only fold latin (ascii) chars. For some language such as greek it is useless.
Some language specific analyzers may provide better folding but those are not used by autocomplete/nearmatch searches.

We cannot integrate it as is because it lacks the preserve_original option. This option allows the filter to emit the unmodified token at the some position allowing more precise searches if the query includes diacritics.

NOTE: ICU folding can be enabled today only for the completion suggester (because preserve_original is not needed by the completion suggester)

(currently enabled on greek wikipedia)

Event Timeline

dcausse created this task.Jun 14 2016, 5:29 PM
Restricted Application added a project: Discovery. · View Herald TranscriptJun 14 2016, 5:29 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript
dcausse updated the task description. (Show Details)Jun 14 2016, 5:30 PM
debt triaged this task as Low priority.Jun 14 2016, 10:13 PM
debt moved this task from needs triage to This Quarter on the Discovery-Search board.

Thanks, David, for inviting me to comment.

This is a serious issue, as the linguistic wealth of a number of wiktionaries' languages is paid lip service by the search and autocomplete functionality.

For example, with Ancient Greek, if one searches for αιων in https://en.wiktionary.org neither the autocomplete nor the search results display the ancient Greek equivalent (only difference being the diacritics). However, the word https://en.wiktionary.org/wiki/αἰών exists, and it can only be found if one enters the exact diacritics, which is quite cumbersome and not very user-friendly (requires the installation of a polytonic keyboard layout).

And I suspect it is not only Ancient Greek that has this type of issues.

On the contrary, the previously used Lucene system with MWsearch, did a great job out of the box, by stripping all diacritics. Indeed, it appears strange to me, how is it possible that Elastica despite using Lucene at its core has trouble replicating the same behaviour.

Kindly refer to this discussion too, which sparkled the interest on this issue:

https://www.mediawiki.org/wiki/Topic:T5orggml6rprzngy

debt added a subscriber: debt.

moving to backlog board for now

Change 310359 had a related patch set uploaded (by DCausse):
Add support for ICU folding

https://gerrit.wikimedia.org/r/310359

Change 310359 merged by jenkins-bot:
Add support for ICU folding

https://gerrit.wikimedia.org/r/310359

Deskana closed this task as Resolved.Dec 9 2016, 3:24 PM