word_break_helper is a character filter used in Elasticsearch. It defines additional word breaks, including period (.), underscore (_), and parens. These make sense in many cases—word_break_helper being split into three words, wikipedia.org being split in two, (unintentionally)poorly(parenthesized) words being split up.
However, word_break_helper also splits up acronyms and initialisms (U.S., N.A.S.A., etc.) which isn't really helpful. There is also a quirk of Elasticsearch that you can configure a character filter with a language analyzer, but it doesn't do anything. So, we have "disabled" word_break_helper in some cases when we've unpacked analyzers, including French and Swedish, because it wasn't actually doing anything before unpacking.
Options include some combination of:
- removing the period from word_break_helper
- no longer making word_break_helper a default (which often doesn't do anything)
- removing it from the English config (and assessing its utility in other places where it is enabled)
- figure out a way to do smarter handling of acronyms (which we probably don't want to split up) and domain names (which we probably do want to split up).
There is also word_break_helper_source_text which is used elsewhere, and includes colons (:). I haven't looked into how it's used and whether it's helpful as currently configured.