[subbu@earth:~/work/wmf/parsoid] node -v
> s = "a b";
[ 'a', 'b' ]
> s2 = "아 고"
[ '아', ' ', '고' ]
We use \w in pegTokenizer.pegjs (and in html2wt constrained text code for autolink urls) and possibly many other places.
This matters for parsing text like `'아들 고건 사진https://m.blog.naver.com/stageph/220175427529'`. This should parse as plain text. However, Parsoid/JS (and currently Parsoid/PHP) parses the http://.. string as an autolink. On Parsoid/PHP, this is fixed by changing the autolink url precheck to `return preg_match( '/\w$/uD', substr( $this->input, 0, $this->endOffset() ) );` to match the checks used by `src/Html2Wt/ConstrainedText/AutoURLLinkText.php`. However, a similar change to Parsoid/JS doesn't fix the parsing of the link text because of the use of \w which doesn't do the right thing.
This broken parsing and broken autolink url nowiki regexps leads to differences in selser output between Parsoid/JS & Parsoid/PHP.
We will likely decline this ticket post Parsoid/PHP deployment, but am filing it here as documentation.