Test effect of adding ascii-folding on French Wikipedia
Closed, ResolvedPublic

Description

After the positive results for T142037, David suggested adding ascii-folding for the French Wikipedia. As noted elsewhere, [citation needed], it's common to see queries without diacritics, and enabling ascii-folding would improve matches in those cases.

We can run a similar analysis as with T142037, setting up the analysis chain as it currently is for French Wikipedia, and then modifying it with the new ascii-folding. We can determine the raw number of new collisions caused by introducing ascii-folding, and get similar automated estimates of similar terms being bucketed together.

TJones created this task.Aug 10 2016, 7:07 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 10 2016, 7:07 PM
TJones added a subscriber: debt.Aug 10 2016, 7:07 PM

@debt: David and I think this should be high priority because we're going to be re-indexing everything in a few weeks for BM25. We should do any of these re-indexing–type studies before that. Re-indexing at a later date is possible but may be too much disruption for this smaller change.

debt triaged this task as High priority.Aug 11 2016, 5:10 PM

Highlights:

  • The default French analysis chain unexpectedly does some ascii-folding already, after stemming.
  • Unpacking the default French analysis chain per the Elasticsearch docs leads to different results, but most of the changes are desirable, and the effect size is very small.
  • English and Italian, which have been similarly unpacked to add ascii-folding in the past, include a bit of extra tokenizing help for periods and underscores, which we may want to also do for French—though it does violence to acronyms and may not work with BM25.
  • Ascii-folding itself effects significantly more tokens than ascii-folding in English—50 times as many (as a percentage) for a 50K article corpus—which is not entirely a surprise, since many more accented characters are regularly used in French.

Full details.

Reccomendations: turn on ascii-folding as a final step for French, add an initial custom filter for Turkish İ, explore disabling word_break_helper for enwiki and itwiki!

@TJones : You moved this to "Needs Review", but who should review it?

dcausse added a comment.EditedAug 31 2016, 9:43 AM

I've reviewed the conclusions of this analysis and I agree.
Few comments:

  • I'm not entirely sure but I think that the problem with the Turkish İ should be resolved as part of T137830. But since it looks like a regression compared to the previous configuration we should add it.
  • Concerning the word_breaker_helper: removing it can cause a regression on T42612 and T64733, I'd suggest working on a proper fix at the analysis level (some thoughts here: T143541).
  • Concerning "Unexpected Differences in Unpacked Analysis Chain": I'd like to understand why unpacking the analysis chain causes differences, it's maybe a bug in the elasticsearch documentation. @TJones are you sure that these "Unexpected Differences in Unpacked Analysis Chain" were seen before adding ascii folding? Or I misunderstood something and the title of this section should be renamed "Unexpected Differences in Unpacked Analysis Chain with an additional ascii folding filter" ?

Should we create a followup task to implement the recommendations?

TJones moved this task from Needs review to Done on the Discovery-Search (Current work) board.

@dcausse and I reviewed the unpacked French analysis chain, and it looks as though I unpacked it correctly. I double checked the default French vs Unpacked vs Unpacked+Ascii-Folding on a couple of the characters that showed the effects: ς & ϐ. They were unchanged by the French analysis, folded to σ and β when French was unpacked, and folded and duplicated when asciifolding_preserve was added, so the analysis is right.

We discovered a semi-bug in the implementation of word_breaker_helper: it's added to the default custom analysis chain in the code, but that doesn't affect the built-in language analyzers. Thus, it is not currently enabled for French, and so we're going to leave it off for now, and deal with that as a separate issue. I'll add comments in the appropriate place in the code to reduce confusion until it is addressed directly.

Changes will be committed under T144429.

debt closed this task as Resolved.Sep 1 2016, 8:52 PM

Thanks, @TJones and @dcausse !