Page MenuHomePhabricator

Enable icu_tokenizer (almost) everywhere and update AnalysisConfigBuilder to use icu_token_repair
Closed, ResolvedPublic8 Estimated Story Points

Description

User Story: As a searcher, I'd prefer better tokenizing of foreign scripts if it is available. However, on wikis where homoglyphs or certain other mixed-script words are common, I'd prefer not to have those tokens broken up unnecessarily.

The icu_tokenizer provides better handling of many Asian scripts, which is much better than the standard tokenizer on wikis where those are foreign languages/scripts. The icu_token_repair filter solves most of the problems with mixed-script tokens (like Latin/Cyrillic homoglyphs, which can often be fixed by homoglyph_norm, or other intentionally mixed-script words).

Enable icu_tokenizer everywhere the standard tokenizer is currently used if icu_token_repair is available, unless there is a language-specific problem with doing so.

The new config should not be committed until the plugin has been deployed. (T356651)

Acceptance Criteria:
Update AnalysisConfigBuilder to...

  • ...enable the icu_tokenizer everywhere the standard tokenizer is currently used, if the textify plugin is available
  • ...automatically include icu_token_repair when the icu_tokenizer is used and the textify plugin is available.
  • ...allow for specifying languages where the icu_tokenizer should always be used (allow), should never be used (deny), or should only be used if textify/icu_token_repair is available (default).

[This has been spun off from the acceptance criteria from the ancestor task (T332337) so it can be tracked separately (and the other can be closed independently), and to extend the task a bit to enable the icu_tokenizer everywhere it can be enabled.]

Event Timeline

TJones changed the task status from Open to In Progress.Feb 5 2024, 2:31 PM
TJones created this task.
TJones set the point value for this task to 3.
TJones moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.
TJones renamed this task from Update AnalysisConfigBuilder to use icu_token_repair to Enable icu_tokenizer (almost) everywhere and update AnalysisConfigBuilder to use icu_token_repair.Feb 5 2024, 3:45 PM
TJones updated the task description. (Show Details)
TJones changed the point value for this task from 3 to 5.

Change 1004702 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Enable icu_tokenizer + icu_token_repair (almost) everywhere

https://gerrit.wikimedia.org/r/1004702

TJones changed the point value for this task from 5 to 8.Feb 20 2024, 10:21 PM

Full write up on MediaWiki.

TL; DR:

  • Overall, the icu_tokenizer does what we want, especially with Asian scripts, and icu_token_repair keeps it from doing most of the thing we don't want it to do.
  • Found a few cases where things don't work quite how we'd like, and added some patches to the analysis chain to address many of them.
  • Found a big handful of characters we should map or delete in general because they are weird. Some interact with the icu_tokenizer, but all but one should be fixed in general, so they have been fixed!

Change 1004702 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Enable icu_tokenizer + icu_token_repair (almost) everywhere

https://gerrit.wikimedia.org/r/1004702