Our current logic does not properly accommodate right-to-left languages.
MARTIN: there are two components for correctly right-to-left languages (non-whitespace is a separate discussion)
- our logic to merge (wrongly) split sentences; for now, we have a logic that works for left-to-right. we could file an issue and adapt the current logic for right-to-left (if it is needed; I am not sure that anything needs to change).
- the identification of abbreviations from wiktionary. I am not sure how abbreviated words appear in right-to-left languages (where is the punctuation symbol and does our regex identify it correctly)
Isaac:
Per discussion, this is unlikely to be an issue and can probably be closed but it'd be great to have a test to document it's not an issue :) RTL vs. LTR is a client-side interpretation of unicode but does not affect how the unicode is actually represented within Python (it's all 0-indexed strings). That makes functions like lstrip and rstrip misnomers because they're really stripping leading/trailing characters, not left/right characters.