Page MenuHomePhabricator

[wiki-nlp-tools] Sentence Tokenization: adapt sentence-split-logic to take into account right-to-left languages
Open, Needs TriagePublic

Description

Our current logic does not properly accommodate right-to-left languages.

MARTIN: there are two components for correctly right-to-left languages (non-whitespace is a separate discussion)

  • our logic to merge (wrongly) split sentences; for now, we have a logic that works for left-to-right. we could file an issue and adapt the current logic for right-to-left (if it is needed; I am not sure that anything needs to change).
  • the identification of abbreviations from wiktionary. I am not sure how abbreviated words appear in right-to-left languages (where is the punctuation symbol and does our regex identify it correctly)

Isaac:
Per discussion, this is unlikely to be an issue and can probably be closed but it'd be great to have a test to document it's not an issue :) RTL vs. LTR is a client-side interpretation of unicode but does not affect how the unicode is actually represented within Python (it's all 0-indexed strings). That makes functions like lstrip and rstrip misnomers because they're really stripping leading/trailing characters, not left/right characters.