Page MenuHomePhabricator

Smarter handling of acronyms for word_break_helper in language analyzers
Open, HighPublic

Description

word_break_helper is a character filter used in Elasticsearch. It defines additional word breaks, including period (.), underscore (_), and parens. These make sense in many cases—word_break_helper being split into three words, wikipedia.org being split in two, (unintentionally)poorly(parenthesized) words being split up.

However, word_break_helper also splits up acronyms and initialisms (U.S., N.A.S.A., etc.) which isn't really helpful. There is also a quirk of Elasticsearch that you can configure a character filter with a language analyzer, but it doesn't do anything. So, we have "disabled" word_break_helper in some cases when we've unpacked analyzers, including French and Swedish, because it wasn't actually doing anything before unpacking.

In general, word_break_helper seems to be a net positive, and should be enabled everywhere, but we need to figure out a way to do smarter handling of acronyms (which we probably don't want to split up) and domain names (which we probably do want to split up).

There is also word_break_helper_source_text which is used elsewhere, and includes colons (:). I haven't looked into how it's used and whether it's helpful as currently configured.

Event Timeline

debt triaged this task as Medium priority.Jul 20 2017, 5:16 PM
debt moved this task from needs triage to Up Next on the Discovery-Search board.

Hey all,

Just wanted to note that I encountered this issue today when searching for "List of US Highways"—"US" wasn't recognized as "U.S." for the purposes of the article title.

TJones raised the priority of this task from Medium to High.Aug 27 2020, 9:46 PM
TJones renamed this task from Investigate disabling or modifying word_break_helper in language analyzers. to Smarter handling of acronyms for word_break_helper in language analyzers.Mon, Aug 8, 8:32 PM
TJones updated the task description. (Show Details)