Page MenuHomePhabricator

Ascii folding behaves differently for autocomplete vs. search
Closed, ResolvedPublic

Description

Steps to reproduce:

  • Go to pl.wikipedia.org and focus the search box.
  • Start typing "Bedusz".
  • You will see results including "Będusz".
  • Hit <return> to search for the plain Latin term.
  • Your search results will not include local pages which include the non-Latin term.

On en.wikipedia.org for example, where ICU folding is enabled, the autocomplete and actual search return similar results which include the non-Latin matches.

It's likely that the search results will be fixed once we enable ICU folding using UTR#30 rules, but this autocomplete bit was mysterious enough to file as a separate bug.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dcausse triaged this task as Medium priority.Jun 21 2019, 1:49 PM
dcausse moved this task from needs triage to elastic / cirrus on the Discovery-Search board.

There is a discrepancy for langues where we do not apply any accent folding. We always apply accent removal (with either asciifolding or icu_folding) on autocomplete and go but we do not on fulltext.
This is inconsistent, I think we should query all_near_match.asciifolding alongside all_near_match for fulltext queries.

Change 518285 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@master] Query asciifolding near match field to avoid discrepancies

https://gerrit.wikimedia.org/r/518285

Change 518285 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Query asciifolding near match fields to avoid discrepancies

https://gerrit.wikimedia.org/r/518285