Page MenuHomePhabricator

Investigate disabling or modifying word_break_helper in language analyzers.
Open, NormalPublic

Description

word_break_helper is a character filter used in Elasticsearch. It defines additional word breaks, including period (.), underscore (_), and parens. These make sense in many cases—word_break_helper being split into three words, wikipedia.org being split in two, (unintentionally)poorly(parenthesized) words being split up.

However, word_break_helper also splits up acronyms and initialisms (U.S., N.A.S.A., etc.) which isn't really helpful. There is also a quirk of Elasticsearch that you can configure a character filter with a language analyzer, but it doesn't do anything. So, we have "disabled" word_break_helper in some cases when we've unpacked analyzers, including French and Swedish, because it wasn't actually doing anything before unpacking.

Options include some combination of:

  • removing the period from word_break_helper
  • no longer making word_break_helper a default (which often doesn't do anything)
  • removing it from the English config (and assessing its utility in other places where it is enabled)
  • figure out a way to do smarter handling of acronyms (which we probably don't want to split up) and domain names (which we probably do want to split up).

There is also word_break_helper_source_text which is used elsewhere, and includes colons (:). I haven't looked into how it's used and whether it's helpful as currently configured.

Event Timeline

TJones created this task.Jul 13 2017, 7:30 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 13 2017, 7:30 PM
debt triaged this task as Normal priority.Jul 20 2017, 5:16 PM
debt moved this task from needs triage to Up Next on the Discovery-Search board.

Hey all,

Just wanted to note that I encountered this issue today when searching for "List of US Highways"—"US" wasn't recognized as "U.S." for the purposes of the article title.

debt moved this task from Up Next to Language Stuff on the Discovery-Search board.Jan 29 2019, 6:44 PM