Repair multi-script tokens split by the ICU tokenizer
Closed, ResolvedPublic13 Estimated Story Points
Actions

Assigned To

Authored By

	TJones
	Mar 16 2023, 5:38 PM

Description

User Story: As a searcher, I expect tokens in alphabetic scripts to be split on whitespace or punctuation, not internally, and especially not dependent on the context of other possibly distant or unrelated tokens.

The icu_tokenizer splits tokens on "script changes", but oddly this can span many tokens. For example, Д12345 67890 98765 43210 3x will split between the 3 and x because of the Д (numbers belong to no particular script so they inherit the script that proceeds them, even across spaces, which really looks like an error). This effect can go across long spans of text, and across sentence boundaries (e.g., "... for 'peace' is 'мир'. 3a illustrates ..." will split 3 + a because of мир).

Genuine multi-script tokens like NGiИX are also split (in this case, NGi + И + X). Homoglyph tokens (e.g., chocоlate, where the bold letter is Cyrillic) are also split up.

In the case of genuine multi-script tokens, this impairs precision, because it's possible to match the pieces when they are not part of the whole. (NGi + И + X is not so likely, but other cases are more likely to have false positives). Homoglyph tokens cannot be repaired (choc, о, and late are all separate and not detectable by the homoglyph plugin), impairing recall. In the case of number+letter tokens (like 3x above), recall is also impaired because the parts do not match the whole (3x in a query does not match 3 + x in an article).

The standard tokenizer doesn't do this (it also doesn't handle many Asian scripts well at all, which is why we have put up with the icu_tokenizer's idiosyncrasies).

This is a known problem for mixes of Latin, Greek, Cyrillic, and numbers, though others should be investigated, too.

Fixing these also requires updating all the relevant bookkeeping info, so that proximity searches and ranking are not adversely affected. The cjk_bigram filter can serve as a model (it combines and splits tokens based on script and offsets).

Acceptance criteria:

Create a plugin with an icu_token_repair filter that detects inappropriately split tokens and correctly repairs them.
~~Update AnalysisConfigBuilder to automatically include icu_token_repair when the icu_tokenizer is used and the plugin is available.~~ (Moved to T356643)

Related Objects
Search...

Status	Assigned	Task
Open	None	T219550 [EPIC] Harmonize language analysis across languages
Resolved	TJones	T332337 Repair multi-script tokens split by the ICU tokenizer
Resolved	RKemper	T356651 Rebuild and deploy textify plugin
Resolved	TJones	T356643 Enable icu_tokenizer (almost) everywhere and update AnalysisConfigBuilder to use icu_token_repair
Resolved	EBernhardson	T342444 Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, word_break_helper, and icu_tokenizer/_repair
Resolved	TJones	T359100 Analyze results of harmonization