Smarter handling of acronyms for word_break_helper in language analyzers
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	TJones
	Jul 13 2017, 7:30 PM

Description

word_break_helper is a character filter used in Elasticsearch. It defines additional word breaks, including period (.), underscore (_), and parens. These make sense in many cases—word_break_helper being split into three words, wikipedia.org being split in two, (unintentionally)poorly(parenthesized) words being split up.

However, word_break_helper also splits up acronyms and initialisms (U.S., N.A.S.A., etc.) which isn't really helpful. There is also a quirk of Elasticsearch that you can configure a character filter with a language analyzer, but it doesn't do anything. So, we have "disabled" word_break_helper in some cases when we've unpacked analyzers, including French and Swedish, because it wasn't actually doing anything before unpacking.

In general, word_break_helper seems to be a net positive, and should be enabled everywhere, but we need to figure out a way to do smarter handling of acronyms (which we probably don't want to split up) and domain names (which we probably do want to split up).

There is also word_break_helper_source_text which is used elsewhere, and includes colons (:). I haven't looked into how it's used and whether it's helpful as currently configured.

Details

	Subject	Repo	Branch	Lines +/-
	Update acronym_fixer regex for Brahmic scripts	mediawiki/extensions/CirrusSearch	master	+174 -171
	Enable updated word_break_helper everywhere-ish	mediawiki/extensions/CirrusSearch	master	+2 K -385

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T219550 [EPIC] Harmonize language analysis across languages
		Resolved		TJones	T170625 Smarter handling of acronyms for word_break_helper in language analyzers

Event Timeline

TJones created this task.Jul 13 2017, 7:30 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 13 2017, 7:30 PM

TJones added a project: Discovery-Search.Jul 13 2017, 7:30 PM

TJones mentioned this in T85770: Build and enable thesaurus / synonym list for search.Jul 13 2017, 7:41 PM

debt triaged this task as Medium priority.Jul 20 2017, 5:16 PM

debt moved this task from needs triage to Up Next on the Discovery-Search board.

TJones mentioned this in T147959: Generic language fallbacks in Mediawiki should not be used for Elasticsearch language analyzers.Oct 10 2017, 8:02 PM

Hey all,

Just wanted to note that I encountered this issue today when searching for "List of US Highways"—"US" wasn't recognized as "U.S." for the purposes of the article title.

debt moved this task from Up Next to Language Stuff on the Discovery-Search board.Jan 29 2019, 6:44 PM

TJones mentioned this in T219108: Investigate applying aggressive_splitting everywhere, not just on English-language wikis.Mar 28 2019, 6:59 PM

TJones mentioned this in T219550: [EPIC] Harmonize language analysis across languages.Mar 28 2019, 7:46 PM

TJones added a parent task: T219550: [EPIC] Harmonize language analysis across languages.

TJones raised the priority of this task from Medium to High.Aug 27 2020, 9:46 PM

TJones renamed this task from Investigate disabling or modifying word_break_helper in language analyzers. to Smarter handling of acronyms for word_break_helper in language analyzers.Aug 8 2022, 8:32 PM

TJones updated the task description. (Show Details)

TJones mentioned this in T331208: Cannot find term that is mixed with other characters.Mar 6 2023, 5:27 PM

TJones merged a task: T331208: Cannot find term that is mixed with other characters.Mar 13 2023, 4:29 PM

TJones added a subscriber: David_Hedlund.

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.Mar 16 2023, 6:17 PM

• MPhamWMF set the point value for this task to 5.Apr 10 2023, 3:53 PM

• MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

TJones claimed this task.May 16 2023, 8:28 PM

TJones moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

TJones moved this task from In Progress to Ready for Dev -- SWE on the Discovery-Search (Current work) board.Jun 8 2023, 10:35 PM

TJones moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.Jun 15 2023, 1:40 PM

Sorry for the hokey pokey—you put the ticket in, you take the ticket out.. you put the ticket in, and you shake it all about—but the aggressive_splitting ticket (T219108) overlaps with this one too much. And! I discovered I can do what I want for acronym collapsing with a regex (probably.. still checking on details) rather than a custom filter, which makes this easier—and I'd feel better about deploying word_break_helper everywhere with that fix in place.

Change 938329 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Enable updated word_break_helper everywhere-ish

https://gerrit.wikimedia.org/r/938329

gerritbot added a project: Patch-For-Review.Jul 15 2023, 12:39 AM

Change 938329 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Enable updated word_break_helper everywhere-ish

https://gerrit.wikimedia.org/r/938329

Maintenance_bot removed a project: Patch-For-Review.Jul 20 2023, 9:30 PM

TJones mentioned this in T342444: Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, word_break_helper, and icu_tokenizer/_repair.Jul 21 2023, 3:15 PM

Change 941913 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Update acronym_fixer regex for Brahmic scripts

https://gerrit.wikimedia.org/r/941913

gerritbot added a project: Patch-For-Review.Jul 27 2023, 10:33 PM

acronym_fixer is rather complicated, as expected. word_break_helper is a little complicated, unexpectedly! More on MediaWiki.

TJones moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Jul 31 2023, 6:17 PM

Change 941913 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Update acronym_fixer regex for Brahmic scripts

https://gerrit.wikimedia.org/r/941913

Maintenance_bot removed a project: Patch-For-Review.Aug 1 2023, 9:11 AM

TJones moved this task from Needs review to To Be Deployed on the Discovery-Search (Current work) board.Aug 1 2023, 2:10 PM

This has been deployed, but the reindexing ws stopped for being too slow. I'll move this ticket into needs reporting and open a new one for the new efficiency refactor.

TJones moved this task from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.Sep 11 2023, 3:13 PM

Gehel closed this task as Resolved.Sep 15 2023, 9:29 AM

Smarter handling of acronyms for word_break_helper in language analyzersClosed, ResolvedPublic5 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Smarter handling of acronyms for word_break_helper in language analyzers
Closed, ResolvedPublic5 Estimated Story Points
Actions

Related Objects
Search...