Page MenuHomePhabricator

Test effect of adding ascii-folding on French Wikipedia
Closed, ResolvedPublic

Description

After the positive results for T142037, David suggested adding ascii-folding for the French Wikipedia. As noted elsewhere, [citation needed], it's common to see queries without diacritics, and enabling ascii-folding would improve matches in those cases.

We can run a similar analysis as with T142037, setting up the analysis chain as it currently is for French Wikipedia, and then modifying it with the new ascii-folding. We can determine the raw number of new collisions caused by introducing ascii-folding, and get similar automated estimates of similar terms being bucketed together.

Event Timeline

@debt: David and I think this should be high priority because we're going to be re-indexing everything in a few weeks for BM25. We should do any of these re-indexing–type studies before that. Re-indexing at a later date is possible but may be too much disruption for this smaller change.

debt triaged this task as High priority.Aug 11 2016, 5:10 PM

Highlights:

  • The default French analysis chain unexpectedly does some ascii-folding already, after stemming.
  • Unpacking the default French analysis chain per the Elasticsearch docs leads to different results, but most of the changes are desirable, and the effect size is very small.
  • English and Italian, which have been similarly unpacked to add ascii-folding in the past, include a bit of extra tokenizing help for periods and underscores, which we may want to also do for French—though it does violence to acronyms and may not work with BM25.
  • Ascii-folding itself effects significantly more tokens than ascii-folding in English—50 times as many (as a percentage) for a 50K article corpus—which is not entirely a surprise, since many more accented characters are regularly used in French.

Full details.

Reccomendations: turn on ascii-folding as a final step for French, add an initial custom filter for Turkish İ, explore disabling word_break_helper for enwiki and itwiki!

@TJones : You moved this to "Needs Review", but who should review it?

I've reviewed the conclusions of this analysis and I agree.
Few comments:

  • I'm not entirely sure but I think that the problem with the Turkish İ should be resolved as part of T137830. But since it looks like a regression compared to the previous configuration we should add it.
  • Concerning the word_breaker_helper: removing it can cause a regression on T42612 and T64733, I'd suggest working on a proper fix at the analysis level (some thoughts here: T143541).
  • Concerning "Unexpected Differences in Unpacked Analysis Chain": I'd like to understand why unpacking the analysis chain causes differences, it's maybe a bug in the elasticsearch documentation. @TJones are you sure that these "Unexpected Differences in Unpacked Analysis Chain" were seen before adding ascii folding? Or I misunderstood something and the title of this section should be renamed "Unexpected Differences in Unpacked Analysis Chain with an additional ascii folding filter" ?

Should we create a followup task to implement the recommendations?

@dcausse and I reviewed the unpacked French analysis chain, and it looks as though I unpacked it correctly. I double checked the default French vs Unpacked vs Unpacked+Ascii-Folding on a couple of the characters that showed the effects: ς & ϐ. They were unchanged by the French analysis, folded to σ and β when French was unpacked, and folded and duplicated when asciifolding_preserve was added, so the analysis is right.

We discovered a semi-bug in the implementation of word_breaker_helper: it's added to the default custom analysis chain in the code, but that doesn't affect the built-in language analyzers. Thus, it is not currently enabled for French, and so we're going to leave it off for now, and deal with that as a separate issue. I'll add comments in the appropriate place in the code to reduce confusion until it is addressed directly.

Changes will be committed under T144429.