While looking into T87548, I noticed that some of the tokenizers split on soft hypens, which is an error. I looked into other common invisibles, and found the following minor issues with invisibles for the various tokenizers.
* SmartCN tokenizer
* splits soft hyphen, ZWNJ, ZWJ, LTR mark, RTL mark, pop directional isolate, pop directional formatting
* creates empty tokens for ZWNJ, ZWJ, ZWSP, LTR mark, RTL mark
* Sudachi
* splits on soft hyphen, ZWNJ, ZWJ, LTR mark
* Kuromoji
* does not split ZWSP
* Hebrew
* splits soft hyphens, ZWNJ, ZWJ, LTR mark, RTL mark, pop directional isolate, pop directional formatting
* Hori
* splits ZWJ, LTR mark, RTL mark, pop directional isolate, pop directional formatting
* Thai
* did not split ZWSPs, as designed during unpacking; harmonized with other analyzers as part of T87548
* the full list of invisibles below should be checked
The full list of invisibles to check for these tokenizers is
* soft hyphen (00AD)
* RTL bidi (200F, 202B, 202E, 2067, 061C)
* non-joiner (200C, 2063)
* first strong isolate bidi (2068)
* joiner (200D, 2060)
* pop bidi (2069, 202C)
* variation selector (FE00-FE0F, E0100-E01EF)
* whitespace (200B, 202F, 3000, FEFF, 00A0)
These can largely be fixed with character filters to either delete characters that should not be split on, or converting characters that //should// split tokens into spaces. The SmartCN tokenizer's empty token problem should be fixed by converting everything earlier, but we should look for empty tokens.