Page MenuHomePhabricator

Test effect of re-ordering kstem and asciifolding on English Wikipedia
Closed, ResolvedPublic

Description

Discussing T141216, David and I came to the conclusion that asciifolding (removing accents and reducing other non-ascii characters to ascii) should happen before stemming on English Wikipedia.

The stemmer (kstem) ignores words with diacritics, which means, for example, that a search for cafetières does not find instances of cafetière. Performing asciifolding before stemming would solve this kind of problem.

Diacritics are generally not distinctive in English, and are never required. Resume and résumé are the only common words I can think of where diacritics distinguish one from the other, though resume is also regularly used for the "CV" meaning. Fiancé(e) and née are possibly more often used with accents, but not always. Similarly façade. Even with foreign terms and place names, English speakers are likely to drop accents and consider the names the same; e.g., Düsseldorf/Dusseldorf, Hồ Chí Minh/Ho Chi Minh, etc.

However, rather than relying solely on our intuition, and to be sure there won't be a huge amount of noise generated by the change, we propose gathering a large chunk of text from English Wikipedia, tokenizing it, and running it through the Elasticsearch analysis chain with the current configuration, and with the new proposed configuration. This will give us a sense of the percentage of terms that will change their stemmed forms, and the number of new potential stemming clashes that will be generated.

If the experiment is a success, then we would roll out the new analysis chain (which requires re-indexing) along with the upgrade to BM25 (which also required re-indexing—so let's only do it once). See T139575 for more on BM25.

Event Timeline

debt triaged this task as Medium priority.Aug 3 2016, 10:58 PM
debt moved this task from needs triage to This Quarter on the Discovery-Search board.
debt added subscribers: dcausse, debt.

I think this sounds like a great idea, @TJones and @dcausse !

Write up is available here:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Re-Ordering_Stemming_and_Ascii-Folding_on_English_Wikipedia

David suggests implementing this now because the code is only used at re-indexing time. So it won't get used until the BM25 re-indexing, but it'll be out there and ready.

@dcausse seems happy with the results, and Kevin looked them over, too, so I think we're done.