Test effect of adding ascii-folding on French Wikipedia
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	TJones
	Aug 10 2016, 7:07 PM

Description

After the positive results for T142037, David suggested adding ascii-folding for the French Wikipedia. As noted elsewhere, [citation needed], it's common to see queries without diacritics, and enabling ascii-folding would improve matches in those cases.

We can run a similar analysis as with T142037, setting up the analysis chain as it currently is for French Wikipedia, and then modifying it with the new ascii-folding. We can determine the raw number of new collisions caused by introducing ascii-folding, and get similar automated estimates of similar terms being bucketed together.

Related Objects
Search...

Status	Assigned	Task
Resolved	• Deskana	T139575 EPIC: Plan to enable BM25 on fulltext search
Resolved	TJones	T141216 ÿ in Spécial:IndexPages search
Resolved	TJones	T142620 Test effect of adding ascii-folding on French Wikipedia

Event Timeline

TJones created this task.Aug 10 2016, 7:07 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 10 2016, 7:07 PM

@debt: David and I think this should be high priority because we're going to be re-indexing everything in a few weeks for BM25. We should do any of these re-indexing–type studies before that. Re-indexing at a later date is possible but may be too much disruption for this smaller change.

debt added a subtask: T141216: ÿ in Spécial:IndexPages search.Aug 11 2016, 4:36 PM

debt removed a subtask: T141216: ÿ in Spécial:IndexPages search.

debt added a parent task: T141216: ÿ in Spécial:IndexPages search.

debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.Aug 11 2016, 4:38 PM

debt added a parent task: T139575: EPIC: Plan to enable BM25 on fulltext search.

debt triaged this task as High priority.Aug 11 2016, 5:10 PM

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Aug 11 2016, 5:15 PM

TJones mentioned this in T141216: ÿ in Spécial:IndexPages search.Aug 12 2016, 7:27 PM

EBernhardson subscribed.Aug 15 2016, 10:12 PM

TJones mentioned this in T41501: Merging Unicode similar-looking characters in internal search (apostrophes, "x" and "×", etc).Aug 19 2016, 4:32 PM

Highlights:

The default French analysis chain unexpectedly does some ascii-folding already, after stemming.
Unpacking the default French analysis chain per the Elasticsearch docs leads to different results, but most of the changes are desirable, and the effect size is very small.
English and Italian, which have been similarly unpacked to add ascii-folding in the past, include a bit of extra tokenizing help for periods and underscores, which we may want to also do for French—though it does violence to acronyms and may not work with BM25.
Ascii-folding itself effects significantly more tokens than ascii-folding in English—50 times as many (as a percentage) for a 50K article corpus—which is not entirely a surprise, since many more accented characters are regularly used in French.

Full details.

Reccomendations: turn on ascii-folding as a final step for French, add an initial custom filter for Turkish İ, explore disabling word_break_helper for enwiki and itwiki!

@TJones : You moved this to "Needs Review", but who should review it?

I've reviewed the conclusions of this analysis and I agree.
Few comments:

I'm not entirely sure but I think that the problem with the Turkish İ should be resolved as part of T137830. But since it looks like a regression compared to the previous configuration we should add it.
Concerning the word_breaker_helper: removing it can cause a regression on T42612 and T64733, I'd suggest working on a proper fix at the analysis level (some thoughts here: T143541).
Concerning "Unexpected Differences in Unpacked Analysis Chain": I'd like to understand why unpacking the analysis chain causes differences, it's maybe a bug in the elasticsearch documentation. @TJones are you sure that these "Unexpected Differences in Unpacked Analysis Chain" were seen before adding ascii folding? Or I misunderstood something and the title of this section should be renamed "Unexpected Differences in Unpacked Analysis Chain with an additional ascii folding filter" ?

Should we create a followup task to implement the recommendations?

@dcausse and I reviewed the unpacked French analysis chain, and it looks as though I unpacked it correctly. I double checked the default French vs Unpacked vs Unpacked+Ascii-Folding on a couple of the characters that showed the effects: ς & ϐ. They were unchanged by the French analysis, folded to σ and β when French was unpacked, and folded and duplicated when asciifolding_preserve was added, so the analysis is right.

We discovered a semi-bug in the implementation of word_breaker_helper: it's added to the default custom analysis chain in the code, but that doesn't affect the built-in language analyzers. Thus, it is not currently enabled for French, and so we're going to leave it off for now, and deal with that as a separate issue. I'll add comments in the appropriate place in the code to reduce confusion until it is addressed directly.

Changes will be committed under T144429.

Thanks, @TJones and @dcausse !

TJones mentioned this in T147505: [tracking] CirrusSearch: what is updated during re-indexing.Oct 6 2016, 6:04 PM

TJones mentioned this in T104814: Appropriately ignore diacritics for German-language wikis.Jun 27 2017, 9:38 PM

TJones mentioned this in T75605: No normalization for ancient greek accents in searches.Jul 6 2017, 8:52 PM

Test effect of adding ascii-folding on French WikipediaClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Test effect of adding ascii-folding on French Wikipedia
Closed, ResolvedPublic
Actions

Related Objects
Search...