While looking into T87548, I noticed that some of the tokenizers split on soft hypens, which is an error. I looked into other common invisibles, and found the following minor issues with invisibles for the various tokenizers.
- SmartCN tokenizer
- splits soft hyphen, ZWNJ, ZWJ, LTR mark, RTL mark, pop directional isolate, pop directional formatting
- creates empty tokens for ZWNJ, ZWJ, ZWSP, LTR mark, RTL mark
- Sudachi
- splits on soft hyphen, ZWNJ, ZWJ, LTR mark
- Kuromoji
- does not split ZWSP
- Hebrew
- splits soft hyphens, ZWNJ, ZWJ, LTR mark, RTL mark, pop directional isolate, pop directional formatting
- Nori
- splits ZWJ, LTR mark, RTL mark, pop directional isolate, pop directional formatting
- Thai
- did not split ZWSPs, as designed during unpacking; harmonized with other analyzers as part of T87548
- the full list of invisibles below should be checked
The full list of invisibles to check for these tokenizers is
- soft hyphen (00AD)
- RTL bidi (200F, 202B, 202E, 2067, 061C)
- non-joiner (200C, 2063)
- first strong isolate bidi (2068)
- joiner (200D, 2060)
- pop bidi (2069, 202C)
- variation selector (FE00-FE0F, E0100-E01EF)
- whitespace (200B, 202F, 3000, FEFF, 00A0)
These can largely be fixed with character filters to either delete characters that should not be split on, or converting characters that should split tokens into spaces. The SmartCN tokenizer's empty token problem should be fixed by converting everything earlier, but we should look for empty tokens.