Use the icu_folding filter if available instead of asciifolding
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcausse
	Jun 14 2016, 5:29 PM

Description

This is an issue for non-latin wikis, the default asciifolding we use will only fold latin (ascii) chars. For some language such as greek it is useless.
Some language specific analyzers may provide better folding but those are not used by autocomplete/nearmatch searches.

We cannot integrate it as is because it lacks the preserve_original option. This option allows the filter to emit the unmodified token at the some position allowing more precise searches if the query includes diacritics.

NOTE: ICU folding can be enabled today only for the completion suggester (because preserve_original is not needed by the completion suggester)

(currently enabled on greek wikipedia)

Details

	Subject	Repo	Branch	Lines +/-
	Add support for ICU folding	mediawiki/extensions/CirrusSearch	master	+417 -20

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	debt	T132637 Lack of diacritic folding in e.g. Ancient Greek
Resolved	dcausse	T137830 Use the icu_folding filter if available instead of asciifolding
Resolved	dcausse	T138749 Add a generic preserve original token filter to the extra plugin

Event Timeline

dcausse created this task.Jun 14 2016, 5:29 PM

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptJun 14 2016, 5:29 PM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript

dcausse updated the task description. (Show Details)Jun 14 2016, 5:30 PM

dcausse added a parent task: T132637: Lack of diacritic folding in e.g. Ancient Greek.Jun 14 2016, 6:07 PM

debt triaged this task as Low priority.Jun 14 2016, 10:13 PM

debt moved this task from needs triage to This Quarter on the Discovery-Search board.

Thanks, David, for inviting me to comment.

This is a serious issue, as the linguistic wealth of a number of wiktionaries' languages is paid lip service by the search and autocomplete functionality.

For example, with Ancient Greek, if one searches for αιων in https://en.wiktionary.org neither the autocomplete nor the search results display the ancient Greek equivalent (only difference being the diacritics). However, the word https://en.wiktionary.org/wiki/αἰών exists, and it can only be found if one enters the exact diacritics, which is quite cumbersome and not very user-friendly (requires the installation of a polytonic keyboard layout).

And I suspect it is not only Ancient Greek that has this type of issues.

On the contrary, the previously used Lucene system with MWsearch, did a great job out of the box, by stripping all diacritics. Indeed, it appears strange to me, how is it possible that Elastica despite using Lucene at its core has trouble replicating the same behaviour.

Kindly refer to this discussion too, which sparkled the interest on this issue:

https://www.mediawiki.org/wiki/Topic:T5orggml6rprzngy

EBernhardson edited projects, added Discovery-Search (Current work); removed Discovery-Search.Jun 21 2016, 10:13 PM

dcausse created subtask T138749: Add a generic preserve original token filter to the extra plugin.Jun 27 2016, 9:52 AM

dcausse mentioned this in T139575: EPIC: Plan to enable BM25 on fulltext search.Jul 7 2016, 9:37 AM

debt closed subtask T138749: Add a generic preserve original token filter to the extra plugin as Resolved.Jul 21 2016, 4:15 PM