Page MenuHomePhabricator

Audit code for proper unicode character handling
Open, MediumPublic

Description

https://he.wikipedia.org/wiki/%D7%91%D7%99%D7%A9%D7%A2%D7%94 has the RTL unicode character 200f (https://en.wikipedia.org/wiki/Right-to-left_mark) several places in the wikitext. When Parsoid parses this text, these characters show up in the HTML as well

<p data-parsoid='{"dsr":[18,66,0,0]}'><200f>
<200f><link rel="mw:PageProp/Category" href="./קטגוריה:טקסים" data-parsoid='{"stx":"simple","a":{"href":"./קטגוריה:טקסים"},"sa":{"href":"קטגוריה:טקסים"},"dsr":[21,38,null,null]}'/><200f>
<200f> <link rel="mw:PageProp/Category" href="./קטגוריה:_שיטות_משפט" data-parsoid='{"stx":"simple","a":{"href":"./קטגוריה:_שיטות_משפט"},"sa":{"href":"קטגוריה: שיטות משפט"},"dsr":[42,65,null,null]}'/><200f></p>

Since this is a control character, ideally it should not trip separator tests. However, the presence of this unicode character trips up the separator insertion code since that code looks for a string of whitespace characters and this control character is treated as printable character and breaks the run of separator characters which leads to the wt2wt roundtrip diffs mentioned in https://gerrit.wikimedia.org/r/#/c/244231/

Looking at the table of unicode chars, I see a number of other control characters. For example, there is 200e which is the LTR marker. There are other characters in the Cf category that are non-display control characters.

Besides this, our regexps currently doesn't handle all the unicode characters in the Zs category. https://en.wikipedia.org/wiki/Whitespace_character lists 25 whitespace characters.

Worth auditing code to see how our various \s-centric regexps in Parsoid measure up against these.

Also, worth auditing the PHP parser code to see how these characters are handled.

Event Timeline

ssastry raised the priority of this task from to Medium.
ssastry updated the task description. (Show Details)
ssastry subscribed.

Change 244472 had a related patch set uploaded (by Subramanya Sastry):
T115018: Some tweaks to separator handling to handle unicode chars.

https://gerrit.wikimedia.org/r/244472

Change 244472 merged by jenkins-bot:
T115018: Some tweaks to separator handling to handle unicode chars.

https://gerrit.wikimedia.org/r/244472

Change 248942 had a related patch set uploaded (by Subramanya Sastry):
Revert "T115018: Some tweaks to separator handling to handle unicode chars."

https://gerrit.wikimedia.org/r/248942

Change 248942 merged by jenkins-bot:
Revert "T115018: Some tweaks to separator handling to handle unicode chars."

https://gerrit.wikimedia.org/r/248942