Page MenuHomePhabricator

Repair multi-script tokens split by the ICU tokenizer
Closed, ResolvedPublic13 Estimated Story Points

Description

User Story: As a searcher, I expect tokens in alphabetic scripts to be split on whitespace or punctuation, not internally, and especially not dependent on the context of other possibly distant or unrelated tokens.

The icu_tokenizer splits tokens on "script changes", but oddly this can span many tokens. For example, Д12345 67890 98765 43210 3x will split between the 3 and x because of the Д (numbers belong to no particular script so they inherit the script that proceeds them, even across spaces, which really looks like an error). This effect can go across long spans of text, and across sentence boundaries (e.g., "... for 'peace' is 'мир'. 3a illustrates ..." will split 3 + a because of мир).

Genuine multi-script tokens like NGiИX are also split (in this case, NGi + И + X). Homoglyph tokens (e.g., chocоlate, where the bold letter is Cyrillic) are also split up.

In the case of genuine multi-script tokens, this impairs precision, because it's possible to match the pieces when they are not part of the whole. (NGi + И + X is not so likely, but other cases are more likely to have false positives). Homoglyph tokens cannot be repaired (choc, о, and late are all separate and not detectable by the homoglyph plugin), impairing recall. In the case of number+letter tokens (like 3x above), recall is also impaired because the parts do not match the whole (3x in a query does not match 3 + x in an article).

The standard tokenizer doesn't do this (it also doesn't handle many Asian scripts well at all, which is why we have put up with the icu_tokenizer's idiosyncrasies).

This is a known problem for mixes of Latin, Greek, Cyrillic, and numbers, though others should be investigated, too.

Fixing these also requires updating all the relevant bookkeeping info, so that proximity searches and ranking are not adversely affected. The cjk_bigram filter can serve as a model (it combines and splits tokens based on script and offsets).

Acceptance criteria:

  • Create a plugin with an icu_token_repair filter that detects inappropriately split tokens and correctly repairs them.
  • Update AnalysisConfigBuilder to automatically include icu_token_repair when the icu_tokenizer is used and the plugin is available. (Moved to T356643)

Event Timeline

TJones changed the point value for this task from 5 to 8.Jul 17 2023, 3:50 PM
TJones changed the point value for this task from 8 to 13.Dec 4 2023, 4:17 PM

Gerrit patch for the plugin (which wasn't added here automatically): https://gerrit.wikimedia.org/r/c/search/extra/+/972478

More detailed writeup (which partially overlaps the plugin docs) on MediaWiki.