Page MenuHomePhabricator

Parsoid/JS use of \w \s \b etc is inconsistent with PHP's behavior when the 'u' regexp modifier is used, which leads to selective serializer output differences between Parsoid/PHP & Parsoid/JS in some scenarios
Open, LowPublic

Description

See below:

[subbu@earth:~/work/wmf/parsoid] node -v
v6.17.1
[subbu@earth:~/work/wmf/parsoid] node
> s = "a b";
'a b'
> s.match(/\w/g);
[ 'a', 'b' ]
> s2 = "아 고"
'아 고'
> s2.match(/\w/g);
null
> s2.match(/\W/g);
[ '아', ' ', '고' ]

We use \w in pegTokenizer.pegjs (and in html2wt constrained text code for autolink urls) and possibly many other places.

This matters for parsing text like '아들 고건 사진https://m.blog.naver.com/stageph/220175427529'. This should parse as plain text. However, Parsoid/JS (and currently Parsoid/PHP) parses the http://.. string as an autolink. On Parsoid/PHP, this is fixed by changing the autolink url precheck to return preg_match( '/\w$/uD', substr( $this->input, 0, $this->endOffset() ) ); to match the checks used by src/Html2Wt/ConstrainedText/AutoURLLinkText.php. However, a similar change to Parsoid/JS doesn't fix the parsing of the link text because of the use of \w which doesn't do the right thing.

This broken parsing and broken autolink url nowiki regexps leads to differences in selser output between Parsoid/JS & Parsoid/PHP.

We will likely decline this ticket post Parsoid/PHP deployment, but am filing it here as documentation.

(Note that having the u modifier change the meaning of \w and \s was added to PHP in 5.4.1 and is likely not obvious to most core devs; see for example 176f91596c80cf5e41484c7377d809d09f0ecbf7 which could have used \s instead of adding \p{Zs} explicitly.)

Details

Related Gerrit Patches:
mediawiki/services/parsoid : masterFix autolink url parsing code

Event Timeline

ssastry created this task.Nov 11 2019, 8:39 PM
Restricted Application added subscribers: jeblad, Aklapper. · View Herald TranscriptNov 11 2019, 8:39 PM
ssastry triaged this task as Low priority.Nov 11 2019, 8:39 PM
ssastry moved this task from Backlog to Parsoid/JS bugs on the Parsoid-PHP board.
cscott added subscribers: hashar, cscott.EditedNov 14 2019, 10:30 PM

I think it's arguably a bug that autolinks use the [\p{L}\p{N}_] definition of \w.
@hashar changed the regexp modifier to include u in 176f91596c80cf5e41484c7377d809d09f0ecbf7 as part of a patch primarily to recognize a wider range of space characters. I'm not sure it was ever intentional to change the behavior of \w here -- using Unicode word characters in a tokenizer is problematic because the set of unicode word characters is not fixed but changes over time. The set of space characters is much less problematic, since it is relatively fixed.

I think a reasonable solution would be to require positive look behind for a space character (or start-of-string) before free external links and RFC/PMID/ISBN. That would avoid dragging in \p{L} (unicode "letters") and \p{N} (unicode "numbers") and let the tokenizer grammar be defined solely in terms of p{Zs} (unicode "space") which is a compact set already defined in the PEG grammar:

// Unicode "separator, space" category.  It covers the \u0020 space as well
// as \u3000 IDEOGRAPHIC SPACE (see bug 19052).  In PHP this is \p{Zs}.
// Keep this up-to-date with the characters tagged ;Zs; in
// http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
unispace = [ \u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]

Change 550366 had a related patch set uploaded (by C. Scott Ananian; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Fix autolink url parsing code

https://gerrit.wikimedia.org/r/550366

cscott renamed this task from Parsoid/JS use of \w is broken and doesn't work with non-ascii characters and leads to selective serializer output differences between Parsoid/PHP & Parsoid/JS in some scenarios to Parsoid/JS use of \w \s \b etc is inconsistent with PHP's behavior when the 'u' regexp modifier is used, which leads to selective serializer output differences between Parsoid/PHP & Parsoid/JS in some scenarios.Nov 15 2019, 6:05 PM
cscott updated the task description. (Show Details)

The original review does not have much details (r93291). I am pretty sure I wrote that patch when trying to help the Chinese community and enhancing the parser for a few of their edge cases.

For that specific patch, the idea was simply to have a URL boundary to end when encountering the U+3000 IDEOGRAPHIC SPACE and I went with just supporting any space separator as stated by Unicode which is \p{Zs} enabled using the regex unicode modifier u. That probably caused the regex to end up being slower as a result :-\

Maybe that was overkill and I should have added support for U+3000. At least there is a parser test for that and it is still in core.

As for \s recognizing U+3000 when in unicode mode, it most probably did not work in the PHP / PCRE version we used at the time and I went with \p{Zs} instead. I would expect to have tested \s and would have find it to not be working?

In my patch, I don't think \w was used, regardless I am pretty sure I would have missed the fact that u would have caused more characters to be matched.

References: T21052 T27409


My guess is that if we still terminate an URL when encountering U+3000, we are fine :-]

Change 550366 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Fix autolink url parsing code

https://gerrit.wikimedia.org/r/550366