Page MenuHomePhabricator

Smarter handling of acronyms for word_break_helper in language analyzers
Closed, ResolvedPublic5 Estimated Story Points

Description

word_break_helper is a character filter used in Elasticsearch. It defines additional word breaks, including period (.), underscore (_), and parens. These make sense in many cases—word_break_helper being split into three words, wikipedia.org being split in two, (unintentionally)poorly(parenthesized) words being split up.

However, word_break_helper also splits up acronyms and initialisms (U.S., N.A.S.A., etc.) which isn't really helpful. There is also a quirk of Elasticsearch that you can configure a character filter with a language analyzer, but it doesn't do anything. So, we have "disabled" word_break_helper in some cases when we've unpacked analyzers, including French and Swedish, because it wasn't actually doing anything before unpacking.

In general, word_break_helper seems to be a net positive, and should be enabled everywhere, but we need to figure out a way to do smarter handling of acronyms (which we probably don't want to split up) and domain names (which we probably do want to split up).

There is also word_break_helper_source_text which is used elsewhere, and includes colons (:). I haven't looked into how it's used and whether it's helpful as currently configured.

Event Timeline

debt triaged this task as Medium priority.Jul 20 2017, 5:16 PM
debt moved this task from needs triage to Up Next on the Discovery-Search board.

Hey all,

Just wanted to note that I encountered this issue today when searching for "List of US Highways"—"US" wasn't recognized as "U.S." for the purposes of the article title.

TJones raised the priority of this task from Medium to High.Aug 27 2020, 9:46 PM
TJones renamed this task from Investigate disabling or modifying word_break_helper in language analyzers. to Smarter handling of acronyms for word_break_helper in language analyzers.Aug 8 2022, 8:32 PM
TJones updated the task description. (Show Details)

Sorry for the hokey pokey—you put the ticket in, you take the ticket out.. you put the ticket in, and you shake it all about—but the aggressive_splitting ticket (T219108) overlaps with this one too much. And! I discovered I can do what I want for acronym collapsing with a regex (probably.. still checking on details) rather than a custom filter, which makes this easier—and I'd feel better about deploying word_break_helper everywhere with that fix in place.

Change 938329 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Enable updated word_break_helper everywhere-ish

https://gerrit.wikimedia.org/r/938329

Change 938329 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Enable updated word_break_helper everywhere-ish

https://gerrit.wikimedia.org/r/938329

Change 941913 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Update acronym_fixer regex for Brahmic scripts

https://gerrit.wikimedia.org/r/941913

acronym_fixer is rather complicated, as expected. word_break_helper is a little complicated, unexpectedly! More on MediaWiki.

Change 941913 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Update acronym_fixer regex for Brahmic scripts

https://gerrit.wikimedia.org/r/941913

This has been deployed, but the reindexing ws stopped for being too slow. I'll move this ticket into needs reporting and open a new one for the new efficiency refactor.