Page MenuHomePhabricator

Completion suggester ignores stress marks
Closed, ResolvedPublic

Description

From the feedback page for the Completion suggester:

"In the Greek language, we use a stress mark (tonos) on vowels to show which syllable is stressed. The normal search of Wikipedia returns articles that differ from the entered string only on stress, which is the expected behaviour. For example, if you enter ανεμος the first search result is άνεμος. However, the CompletionSuggester does not suggest άνεμος or any other word that starts with ά. It does, however, return ανεμοδαρμένος, ανεμούριο, ανεμοβλογιά, ανεμοστρόβιλος and several other words that start with an alpha without a stress mark. Rentzepopoulos (talk) 11:22, 8 March 2016 (UTC)"

I can confirm that the completion suggester beta feature is not taking these stress marks into consideration when presenting possible matches.

Potentially related task: T75605

Event Timeline

Deskana subscribed.

Putting this into Discovery-Search (Current work) for investigation, as the user indicates this is potentially a regression in the completion suggester.

Per Nik analysis in T75605: this is a limitation in the ASCII Folding code we use for prefixsearch and completion suggester.

Folding of ά is handled by the greek stemmer and thus supported in fulltext search.
We do not use a stemmer for search as you type (both prefixsearch and the new completion suggester).

We could switch to ICU Folding which seems to handle a wider range or unicode space.

To sum up, this is not a regression, old prefixsearch uses the same asciifolding and does not support such behavior.

This is always difficult to change the analysis chain, I'd propose to fix the completion suggester analysis first and add an option so we could test on greek wikis first.
An option would allow us to revert back to the old behavior if icu_folding introduces unwanted behaviors.
Then when we are confident that icu_folding is better than ascii_folding we could fix the old prefixsearch analysis config.

NOTE: fixing the old prefixsearch analysis chain might be more complex since we share the same asciifolding filter for multiple purposes and icu_folding lacks the option preserve_original added by Nik to asciifolding.

Change 277249 had a related patch set uploaded (by DCausse):
CompletionSuggester: add support for ICU Folding

https://gerrit.wikimedia.org/r/277249

Change 277249 merged by jenkins-bot:
CompletionSuggester: add support for ICU Folding

https://gerrit.wikimedia.org/r/277249

We think this problem should be solved by the above patches, so resolving this task. This should roll out this week, so we'll get quick feedback from users whether it's working or not.

Change 277477 had a related patch set uploaded (by DCausse):
Enable ICU Folding on greek wikipedia

https://gerrit.wikimedia.org/r/277477

Change 277477 merged by jenkins-bot:
Enable ICU Folding on greek wikipedia

https://gerrit.wikimedia.org/r/277477