User Story: As a searcher, I'd prefer better tokenizing of foreign scripts if it is available. However, on wikis where homoglyphs or certain other mixed-script words are common, I'd prefer not to have those tokens broken up unnecessarily.
The icu_tokenizer provides better handling of many Asian scripts, which is much better than the standard tokenizer on wikis where those are foreign languages/scripts. The icu_token_repair filter solves most of the problems with mixed-script tokens (like Latin/Cyrillic homoglyphs, which can often be fixed by homoglyph_norm, or other intentionally mixed-script words).
Enable icu_tokenizer everywhere the standard tokenizer is currently used if icu_token_repair is available, unless there is a language-specific problem with doing so.
The new config should not be committed until the plugin has been deployed. (T356651)
Acceptance Criteria:
Update AnalysisConfigBuilder to...
- ...enable the icu_tokenizer everywhere the standard tokenizer is currently used, if the textify plugin is available
- ...automatically include icu_token_repair when the icu_tokenizer is used and the textify plugin is available.
- ...allow for specifying languages where the icu_tokenizer should always be used (allow), should never be used (deny), or should only be used if textify/icu_token_repair is available (default).
[This has been spun off from the acceptance criteria from the ancestor task (T332337) so it can be tracked separately (and the other can be closed independently), and to extend the task a bit to enable the icu_tokenizer everywhere it can be enabled.]