Maniphest T192502

Don't index empty strings caused by ICU Folding in Elasticsearch
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	TJones
	Apr 18 2018, 9:23 PM

Description

Under certain circumstances, certain characters can generate empty tokens. The empty strings get indexed, and the original characters that map to empty strings are all conflated at search time.

Part of the problem comes from the tokenizer splitting on some of these characters, other times, the characters are used independently—e.g., in lists in a Wikipedia article on a character set. The articles with lists of these characters score well, too, since they have lots of hits!

Adding a length token filter to the analysis chain—probably automatically when ICU Folding is enabled—will eliminate the empty strings from being indexed. The characters should still be matchable via the plain field—though we should verify that preserve_original works as expected when the folded token is eliminated.

Below are a list of characters that were highlighted in a search for just one of them on English Wikipedia.

ˍ ˎ ˏ ˬ ̂ ̟ ̣ ̤ ֪ ۥ ့ ် ႋ ႌ ႍ ႏ ႚ ႛ ᩶ ᩷ ᩸ ᩹ ᩺ ᩻ ᩼ ᱹ ⸯ ㅤ ꜜ ꜞ ꜟ ꞈ ꪿ ꫀ ꫁ ꫂ ﳲ ﳳ ﳴ ︀ ︎ ˈˈ ˌˌ ːː ː̀ ː̃ ـִ ـْ ـٓ ــٰ ـً‎ ــِ ʽ ̃ ̇ ʹ ּ ߴ ็ ้ ๊ ์ ๎ ່ ້ ໊ ໋ ႈ ႉ ႊ ៊ ់ ៌ ៍ ៎ ៏ ័ ៑ ្ ៓ ៝ ᩵ ᵎ ꙿ ꜝ ️ ﹹ ﹻ ﹿ ـً ـٌ ـٍ ـ ߴ‎ ߵ‎ ߺ‎ ーー ̶ ๋ ႇ ៉ ᴻ ﹱ ﹷ ﹽ ـَ ـُ ـِ ˁ ˑ ́ ̰ ՙ ่ ໌ ᴯ ｰ ˊ ˮ ̅ ــ ̸ ˌ ॱ ʹ ʺ ˋ ️⃣ ⃣ ʻ ـ ˉ ˈ ˆ ʾ ˇ ʼ ʿ ー ː ˀ

Details

	Subject	Repo	Branch	Lines +/-
	Strip Empty Tokens After ICU Folding	mediawiki/extensions/CirrusSearch	master	+1 K -18

Customize query in gerrit

Related Objects

Mentioned In: T203117: Greek language analysis generates unexpected empty tokens

Event Timeline

TJones created this task.Apr 18 2018, 9:23 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 18 2018, 9:23 PM

• EBjune moved this task from needs triage to Up Next on the Discovery-Search board.Apr 26 2018, 5:14 PM

debt triaged this task as Medium priority.Apr 26 2018, 5:14 PM

TJones claimed this task.Aug 28 2018, 8:32 PM

TJones moved this task from Up Next to Current work on the Discovery-Search board.

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.

TJones mentioned this in T203117: Greek language analysis generates unexpected empty tokens.Aug 29 2018, 9:03 PM

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Aug 29 2018, 9:21 PM

TJones renamed this task from Don't index empty strings in Elasticsearch to Don't index empty strings caused by ICU Folding in Elasticsearch.Aug 30 2018, 3:59 PM

Patch incoming in a moment. Analysis across smallish samples from ~10 relevant Wikipedias shows little impact on most text, and no unintended consequences. Full write up is on MediaWiki.

Change 456428 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/extensions/CirrusSearch@master] Strip Empty Tokens After ICU Folding

https://gerrit.wikimedia.org/r/456428

gerritbot added a project: Patch-For-Review.Aug 30 2018, 6:53 PM

TJones moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Aug 30 2018, 6:57 PM

Change 456428 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Strip Empty Tokens After ICU Folding

https://gerrit.wikimedia.org/r/456428

TJones moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Sep 4 2018, 7:38 PM

ReleaseTaggerBot added a project: MW-1.32-notes (WMF-deploy-2018-09-18 (1.32.0-wmf.22)).Sep 4 2018, 8:00 PM

debt closed this task as Resolved.Sep 13 2018, 9:21 PM