Don't index empty strings caused by ICU Folding in Elasticsearch
Under certain circumstances, certain characters can generate empty tokens. The empty strings get indexed, and the original characters that map to empty strings are all conflated at search time.

Part of the problem comes from the tokenizer splitting on some of these characters, other times, the characters are used independently—e.g., in lists in a Wikipedia article on a character set. The articles with lists of these characters score well, too, since they have lots of hits!

Adding a length token filter to the analysis chain—probably automatically when ICU Folding is enabled—will eliminate the empty strings from being indexed. The characters should still be matchable via the plain field—though we should verify that preserve_original works as expected when the folded token is eliminated.

Below are a list of characters that were highlighted in a search for just one of them on English Wikipedia.

  • ˍ ˎ ˏ ˬ ̂ ̟ ̣ ̤ ֪ ۥ ့ ် ႋ ႌ ႍ ႏ ႚ ႛ ᩶ ᩷ ᩸ ᩹ ᩺ ᩻ ᩼ ᱹ ⸯ ㅤ ꜜ ꜞ ꜟ ꞈ ꪿ ꫀ ꫁ ꫂ ﳲ ﳳ ﳴ ︀ ︎ ˈˈ ˌˌ ːː ː̀ ː̃ ـִ ـْ ـٓ ــٰ ـً‎ ــِ ʽ ̃ ̇ ʹ ּ ߴ ็ ้ ๊ ์ ๎ ່ ້ ໊ ໋ ႈ ႉ ႊ ៊ ់ ៌ ៍ ៎ ៏ ័ ៑ ្ ៓ ៝ ᩵ ᵎ ꙿ ꜝ ️ ﹹ ﹻ ﹿ ـً ـٌ ـٍ ـ ߴ‎ ߵ‎ ߺ‎ ーー ̶ ๋ ႇ ៉ ᴻ ﹱ ﹷ ﹽ ـَ ـُ ـِ ˁ ˑ ́ ̰ ՙ ่ ໌ ᴯ ー ˊ ˮ ̅ ــ ̸ ˌ ॱ ʹ ʺ ˋ ️⃣ ⃣ ʻ ـ ˉ ˈ ˆ ʾ ˇ ʼ ʿ ー ː ˀ

Patch incoming in a moment. Analysis across smallish samples from ~10 relevant Wikipedias shows little impact on most text, and no unintended consequences. Full write up is on MediaWiki.

