Page MenuHomePhabricator

Ascii folding preserve original emits duplicated tokens for non ascii char
Closed, ResolvedPublic

Description

I think it's a bug in lucene ascii folding filter, if a char is > 80 then it will emitted twice even if it's unchanged.
Problems are:

  • frequencies for such terms are doubled
  • we store an extra position in the posting

It's really hard to evaluate the impact on scoring and index size but this problem affects mostly non latin wikis where nearly all the words will be duplicated.
I'll try to fix the issue upstream but I believe that we should maybe fix this problem on our side by not using the preserve_original option on asciifolding but rather use the preserve_original generic filter added for icu folding in the extra plugin.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

asciifolding_preserve is not enabled by default thus it affects only few wikis (english, italian and will affect frwiki after a reindex).
I think it's still worth adding a workaround but the effect will be minimal on index size.

Change 313565 had a related patch set uploaded (by DCausse):
Workaround asciifolding issue with preserve_original

https://gerrit.wikimedia.org/r/313565

Change 313565 merged by jenkins-bot:
Workaround asciifolding issue with preserve_original

https://gerrit.wikimedia.org/r/313565

debt triaged this task as Medium priority.Sep 30 2016, 9:53 PM
debt subscribed.

This was released the week of Oct 4 2016 on the train (after the week of no production pushes)