Page MenuHomePhabricator

Don't index empty strings caused by ICU Folding in Elasticsearch
Closed, ResolvedPublic

Description

Under certain circumstances, certain characters can generate empty tokens. The empty strings get indexed, and the original characters that map to empty strings are all conflated at search time.

Part of the problem comes from the tokenizer splitting on some of these characters, other times, the characters are used independently—e.g., in lists in a Wikipedia article on a character set. The articles with lists of these characters score well, too, since they have lots of hits!

Adding a length token filter to the analysis chain—probably automatically when ICU Folding is enabled—will eliminate the empty strings from being indexed. The characters should still be matchable via the plain field—though we should verify that preserve_original works as expected when the folded token is eliminated.

Below are a list of characters that were highlighted in a search for just one of them on English Wikipedia.

  • ˍ ˎ ˏ ˬ ̂ ̟ ̣ ̤ ֪ ۥ ့ ် ႋ ႌ ႍ ႏ ႚ ႛ ᩶ ᩷ ᩸ ᩹ ᩺ ᩻ ᩼ ᱹ ⸯ ㅤ ꜜ ꜞ ꜟ ꞈ ꪿ ꫀ ꫁ ꫂ ﳲ ﳳ ﳴ ︀ ︎ ˈˈ ˌˌ ːː ː̀ ː̃ ـִ ـْ ـٓ ــٰ ـً‎ ــِ ʽ ̃ ̇ ʹ ּ ߴ ็ ้ ๊ ์ ๎ ່ ້ ໊ ໋ ႈ ႉ ႊ ៊ ់ ៌ ៍ ៎ ៏ ័ ៑ ្ ៓ ៝ ᩵ ᵎ ꙿ ꜝ ️ ﹹ ﹻ ﹿ ـً ـٌ ـٍ ـ ߴ‎ ߵ‎ ߺ‎ ーー ̶ ๋ ႇ ៉ ᴻ ﹱ ﹷ ﹽ ـَ ـُ ـِ ˁ ˑ ́ ̰ ՙ ่ ໌ ᴯ ー ˊ ˮ ̅ ــ ̸ ˌ ॱ ʹ ʺ ˋ ️⃣ ⃣ ʻ ـ ˉ ˈ ˆ ʾ ˇ ʼ ʿ ー ː ˀ

Event Timeline

TJones created this task.Apr 18 2018, 9:23 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 18 2018, 9:23 PM
EBjune moved this task from needs triage to Up Next on the Discovery-Search board.Apr 26 2018, 5:14 PM
debt triaged this task as Normal priority.Apr 26 2018, 5:14 PM
TJones claimed this task.Aug 28 2018, 8:32 PM
TJones moved this task from Up Next to Current work on the Discovery-Search board.
TJones renamed this task from Don't index empty strings in Elasticsearch to Don't index empty strings caused by ICU Folding in Elasticsearch.Aug 30 2018, 3:59 PM

Patch incoming in a moment. Analysis across smallish samples from ~10 relevant Wikipedias shows little impact on most text, and no unintended consequences. Full write up is on MediaWiki.

Change 456428 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/extensions/CirrusSearch@master] Strip Empty Tokens After ICU Folding

https://gerrit.wikimedia.org/r/456428

Change 456428 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Strip Empty Tokens After ICU Folding

https://gerrit.wikimedia.org/r/456428

debt closed this task as Resolved.Sep 13 2018, 9:21 PM