Enable icu_tokenizer (almost) everywhere and update AnalysisConfigBuilder to use icu_token_repair
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	TJones
	Feb 5 2024, 2:31 PM

Description

User Story: As a searcher, I'd prefer better tokenizing of foreign scripts if it is available. However, on wikis where homoglyphs or certain other mixed-script words are common, I'd prefer not to have those tokens broken up unnecessarily.

The icu_tokenizer provides better handling of many Asian scripts, which is much better than the standard tokenizer on wikis where those are foreign languages/scripts. The icu_token_repair filter solves most of the problems with mixed-script tokens (like Latin/Cyrillic homoglyphs, which can often be fixed by homoglyph_norm, or other intentionally mixed-script words).

Enable icu_tokenizer everywhere the standard tokenizer is currently used if icu_token_repair is available, unless there is a language-specific problem with doing so.

The new config should not be committed until the plugin has been deployed. (T356651)

Acceptance Criteria:
Update AnalysisConfigBuilder to...

...enable the icu_tokenizer everywhere the standard tokenizer is currently used, if the textify plugin is available
...automatically include icu_token_repair when the icu_tokenizer is used and the textify plugin is available.
...allow for specifying languages where the icu_tokenizer should always be used (allow), should never be used (deny), or should only be used if textify/icu_token_repair is available (default).

[This has been spun off from the acceptance criteria from the ancestor task (T332337) so it can be tracked separately (and the other can be closed independently), and to extend the task a bit to enable the icu_tokenizer everywhere it can be enabled.]

Details

	Subject	Repo	Branch	Lines +/-
	Enable icu_tokenizer + icu_token_repair (almost) everywhere	mediawiki/extensions/CirrusSearch	master	+2 K -1 K

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T219550 [EPIC] Harmonize language analysis across languages
Resolved	TJones	T332337 Repair multi-script tokens split by the ICU tokenizer
Resolved	RKemper	T356651 Rebuild and deploy textify plugin
Resolved	TJones	T356643 Enable icu_tokenizer (almost) everywhere and update AnalysisConfigBuilder to use icu_token_repair
Resolved	EBernhardson	T342444 Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, word_break_helper, and icu_tokenizer/_repair
Resolved	TJones	T359100 Analyze results of harmonization

Event Timeline

TJones changed the task status from Open to In Progress.Feb 5 2024, 2:31 PM

TJones created this task.

TJones set the point value for this task to 3.

TJones moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.

TJones mentioned this in T332337: Repair multi-script tokens split by the ICU tokenizer.

TJones renamed this task from Update AnalysisConfigBuilder to use icu_token_repair to Enable icu_tokenizer (almost) everywhere and update AnalysisConfigBuilder to use icu_token_repair.Feb 5 2024, 3:45 PM

TJones updated the task description. (Show Details)

TJones changed the point value for this task from 3 to 5.

TJones edited parent tasks, added: T356651: Rebuild and deploy textify plugin; removed: T332337: Repair multi-script tokens split by the ICU tokenizer.Feb 5 2024, 3:49 PM

TJones updated the task description. (Show Details)

TJones mentioned this in T342444: Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, word_break_helper, and icu_tokenizer/_repair.Feb 5 2024, 3:55 PM

TJones added a subtask: T342444: Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, word_break_helper, and icu_tokenizer/_repair.

Change 1004702 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Enable icu_tokenizer + icu_token_repair (almost) everywhere

https://gerrit.wikimedia.org/r/1004702

gerritbot added a project: Patch-For-Review.Feb 20 2024, 9:29 PM

Full write up on MediaWiki.

TL; DR:

Overall, the icu_tokenizer does what we want, especially with Asian scripts, and icu_token_repair keeps it from doing most of the thing we don't want it to do.
Found a few cases where things don't work quite how we'd like, and added some patches to the analysis chain to address many of them.
Found a big handful of characters we should map or delete in general because they are weird. Some interact with the icu_tokenizer, but all but one should be fixed in general, so they have been fixed!