Page MenuHomePhabricator

Don't index empty strings caused by ICU Folding in Elasticsearch
Closed, ResolvedPublic


Under certain circumstances, certain characters can generate empty tokens. The empty strings get indexed, and the original characters that map to empty strings are all conflated at search time.

Part of the problem comes from the tokenizer splitting on some of these characters, other times, the characters are used independently—e.g., in lists in a Wikipedia article on a character set. The articles with lists of these characters score well, too, since they have lots of hits!

Adding a length token filter to the analysis chain—probably automatically when ICU Folding is enabled—will eliminate the empty strings from being indexed. The characters should still be matchable via the plain field—though we should verify that preserve_original works as expected when the folded token is eliminated.

Below are a list of characters that were highlighted in a search for just one of them on English Wikipedia.

  • ˍ ˎ ˏ ˬ ̂ ̟ ̣ ̤ ֪ ۥ ့ ် ႋ ႌ ႍ ႏ ႚ ႛ ᩶ ᩷ ᩸ ᩹ ᩺ ᩻ ᩼ ᱹ ⸯ ㅤ ꜜ ꜞ ꜟ ꞈ ꪿ ꫀ ꫁ ꫂ ﳲ ﳳ ﳴ ︀ ︎ ˈˈ ˌˌ ːː ː̀ ː̃ ـִ ـْ ـٓ ــٰ ـً‎ ــِ ʽ ̃ ̇ ʹ ּ ߴ ็ ้ ๊ ์ ๎ ່ ້ ໊ ໋ ႈ ႉ ႊ ៊ ់ ៌ ៍ ៎ ៏ ័ ៑ ្ ៓ ៝ ᩵ ᵎ ꙿ ꜝ ️ ﹹ ﹻ ﹿ ـً ـٌ ـٍ ـ ߴ‎ ߵ‎ ߺ‎ ーー ̶ ๋ ႇ ៉ ᴻ ﹱ ﹷ ﹽ ـَ ـُ ـِ ˁ ˑ ́ ̰ ՙ ่ ໌ ᴯ ー ˊ ˮ ̅ ــ ̸ ˌ ॱ ʹ ʺ ˋ ️⃣ ⃣ ʻ ـ ˉ ˈ ˆ ʾ ˇ ʼ ʿ ー ː ˀ

Event Timeline

debt triaged this task as Medium priority.Apr 26 2018, 5:14 PM
TJones renamed this task from Don't index empty strings in Elasticsearch to Don't index empty strings caused by ICU Folding in Elasticsearch.Aug 30 2018, 3:59 PM

Patch incoming in a moment. Analysis across smallish samples from ~10 relevant Wikipedias shows little impact on most text, and no unintended consequences. Full write up is on MediaWiki.

Change 456428 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/extensions/CirrusSearch@master] Strip Empty Tokens After ICU Folding

Change 456428 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Strip Empty Tokens After ICU Folding