Maniphest T142037

Test effect of re-ordering kstem and asciifolding on English Wikipedia
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	TJones
	Aug 3 2016, 8:10 PM

Description

Discussing T141216, David and I came to the conclusion that asciifolding (removing accents and reducing other non-ascii characters to ascii) should happen before stemming on English Wikipedia.

The stemmer (kstem) ignores words with diacritics, which means, for example, that a search for cafetières does not find instances of cafetière. Performing asciifolding before stemming would solve this kind of problem.

Diacritics are generally not distinctive in English, and are never required. Resume and résumé are the only common words I can think of where diacritics distinguish one from the other, though resume is also regularly used for the "CV" meaning. Fiancé(e) and née are possibly more often used with accents, but not always. Similarly façade. Even with foreign terms and place names, English speakers are likely to drop accents and consider the names the same; e.g., Düsseldorf/Dusseldorf, Hồ Chí Minh/Ho Chi Minh, etc.

However, rather than relying solely on our intuition, and to be sure there won't be a huge amount of noise generated by the change, we propose gathering a large chunk of text from English Wikipedia, tokenizing it, and running it through the Elasticsearch analysis chain with the current configuration, and with the new proposed configuration. This will give us a sense of the percentage of terms that will change their stemmed forms, and the number of new potential stemming clashes that will be generated.

If the experiment is a success, then we would roll out the new analysis chain (which requires re-indexing) along with the upgrade to BM25 (which also required re-indexing—so let's only do it once). See T139575 for more on BM25.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• Deskana	T139575 EPIC: Plan to enable BM25 on fulltext search
		Resolved		TJones	T142037 Test effect of re-ordering kstem and asciifolding on English Wikipedia

Event Timeline

TJones created this task.Aug 3 2016, 8:10 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 3 2016, 8:10 PM

I think this sounds like a great idea, @TJones and @dcausse !

debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.Aug 8 2016, 4:42 PM

TJones claimed this task.Aug 8 2016, 7:27 PM

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Aug 9 2016, 1:17 PM

Write up is available here:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Re-Ordering_Stemming_and_Ascii-Folding_on_English_Wikipedia

David suggests implementing this now because the code is only used at re-indexing time. So it won't get used until the BM25 re-indexing, but it'll be out there and ready.

TJones moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Aug 10 2016, 6:59 PM

TJones mentioned this in T142620: Test effect of adding ascii-folding on French Wikipedia.Aug 10 2016, 7:07 PM