Page MenuHomePhabricator

Harmonize Invisibles in Cirrus Language Analysis
Open, Needs TriagePublic3 Estimated Story Points

Description

While looking into T87548, I noticed that some of the tokenizers split on soft hypens, which is an error. I looked into other common invisibles, and found the following minor issues with invisibles for the various tokenizers.

  • SmartCN tokenizer
    • splits soft hyphen, ZWNJ, ZWJ, LTR mark, RTL mark, pop directional isolate, pop directional formatting
    • creates empty tokens for ZWNJ, ZWJ, ZWSP, LTR mark, RTL mark
  • Sudachi
    • splits on soft hyphen, ZWNJ, ZWJ, LTR mark
  • Kuromoji
    • does not split ZWSP
  • Hebrew
    • splits soft hyphens, ZWNJ, ZWJ, LTR mark, RTL mark, pop directional isolate, pop directional formatting
  • Nori
    • splits ZWJ, LTR mark, RTL mark, pop directional isolate, pop directional formatting
  • Thai
    • did not split ZWSPs, as designed during unpacking; harmonized with other analyzers as part of T87548
    • the full list of invisibles below should be checked

The full list of invisibles to check for these tokenizers is

  • soft hyphen (00AD)
  • RTL bidi (200F, 202B, 202E, 2067, 061C)
  • non-joiner (200C, 2063)
  • first strong isolate bidi (2068)
  • joiner (200D, 2060)
  • pop bidi (2069, 202C)
  • variation selector (FE00-FE0F, E0100-E01EF)
  • whitespace (200B, 202F, 3000, FEFF, 00A0)

These can largely be fixed with character filters to either delete characters that should not be split on, or converting characters that should split tokens into spaces. The SmartCN tokenizer's empty token problem should be fixed by converting everything earlier, but we should look for empty tokens.

Details

Related Changes in Gerrit:

Event Timeline

TJones renamed this task from Harmonize Invisibles to Harmonize Invisibles in Cirrus Language Analysis.Sep 22 2025, 3:19 PM
pfischer set the point value for this task to 3.Sep 22 2025, 3:39 PM

Change #1200117 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Harmonize Common Invisibles

https://gerrit.wikimedia.org/r/1200117

Change #1200117 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Harmonize Common Invisibles

https://gerrit.wikimedia.org/r/1200117