Page MenuHomePhabricator

Ascii folding preserve original emits duplicated tokens for non ascii char
Closed, ResolvedPublic

Description

I think it's a bug in lucene ascii folding filter, if a char is > 80 then it will emitted twice even if it's unchanged.
Problems are:

  • frequencies for such terms are doubled
  • we store an extra position in the posting

It's really hard to evaluate the impact on scoring and index size but this problem affects mostly non latin wikis where nearly all the words will be duplicated.
I'll try to fix the issue upstream but I believe that we should maybe fix this problem on our side by not using the preserve_original option on asciifolding but rather use the preserve_original generic filter added for icu folding in the extra plugin.

Event Timeline

dcausse created this task.Sep 28 2016, 10:16 AM
Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptSep 28 2016, 10:16 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dcausse claimed this task.

asciifolding_preserve is not enabled by default thus it affects only few wikis (english, italian and will affect frwiki after a reindex).
I think it's still worth adding a workaround but the effect will be minimal on index size.

Change 313565 had a related patch set uploaded (by DCausse):
Workaround asciifolding issue with preserve_original

https://gerrit.wikimedia.org/r/313565

Change 313565 merged by jenkins-bot:
Workaround asciifolding issue with preserve_original

https://gerrit.wikimedia.org/r/313565

debt triaged this task as Normal priority.Sep 30 2016, 9:53 PM
debt closed this task as Resolved.Oct 7 2016, 9:13 PM
debt moved this task from Needs review to Done on the Discovery-Search (Current work) board.
debt added a subscriber: debt.

This was released the week of Oct 4 2016 on the train (after the week of no production pushes)