https://he.wikipedia.org/wiki/%D7%91%D7%99%D7%A9%D7%A2%D7%94 has the RTL unicode character 200f (https://en.wikipedia.org/wiki/Right-to-left_mark) several places in the wikitext. When Parsoid parses this text, these characters show up in the HTML as well
<p data-parsoid='{"dsr":[18,66,0,0]}'><200f> <200f><link rel="mw:PageProp/Category" href="./קטגוריה:טקסים" data-parsoid='{"stx":"simple","a":{"href":"./קטגוריה:טקסים"},"sa":{"href":"קטגוריה:טקסים"},"dsr":[21,38,null,null]}'/><200f> <200f> <link rel="mw:PageProp/Category" href="./קטגוריה:_שיטות_משפט" data-parsoid='{"stx":"simple","a":{"href":"./קטגוריה:_שיטות_משפט"},"sa":{"href":"קטגוריה: שיטות משפט"},"dsr":[42,65,null,null]}'/><200f></p>
Since this is a control character, ideally it should not trip separator tests. However, the presence of this unicode character trips up the separator insertion code since that code looks for a string of whitespace characters and this control character is treated as printable character and breaks the run of separator characters which leads to the wt2wt roundtrip diffs mentioned in https://gerrit.wikimedia.org/r/#/c/244231/
Looking at the table of unicode chars, I see a number of other control characters. For example, there is 200e which is the LTR marker. There are other characters in the Cf category that are non-display control characters.
Besides this, our regexps currently doesn't handle all the unicode characters in the Zs category. https://en.wikipedia.org/wiki/Whitespace_character lists 25 whitespace characters.
Worth auditing code to see how our various \s-centric regexps in Parsoid measure up against these.
Also, worth auditing the PHP parser code to see how these characters are handled.