Page MenuHomePhabricator

Bug tokenizing commented <ref>
Closed, ResolvedPublic

Description

The commented <ref> seems to throw off the tokenizer -- it is probably the regexp that extracts the ref content there or the ordering of productions (comments and extension tags).

A<ref>B <!--<ref name="x" />--></ref>
C<ref>D</ref>

Reduced test case from output seen in http://parsoid-lb.eqiad.wikimedia.org/enwiki/Axial_Seamount?oldid=657127754
See report here: https://en.wikipedia.org/w/index.php?title=Wikipedia:VisualEditor/Feedback&oldid=657286828#Article_swallowed_as_a_note.

Event Timeline

ssastry raised the priority of this task from to Medium.
ssastry updated the task description. (Show Details)
ssastry subscribed.
ssastry set Security to None.

I did a quick test:

-                    while (s && s.match(new RegExp("<" + tagName + "[^<>]*>"))) {
+                    while (s && s.match(new RegExp("<" + tagName + "[^/<>]*>"))) {

That change fixes this specific test case. However, there is a larger issue here which is that comment parsing has lower precedence than extension content parsing => there will be several other test cases where commented out opening/closing <ref> tags (or any extension tag, really) will parse differently in Parsoid when compared to the PHP parser (where comments are stripped out of the text before additional processing).

Change 282394 had a related patch set uploaded (by Arlolra):
T96555: Remove <ref> hack from the tokenizer

https://gerrit.wikimedia.org/r/282394

Change 282394 had a related patch set uploaded (by Arlolra):
WIP: Remove <ref> hack from the tokenizer

https://gerrit.wikimedia.org/r/282394

Change 326890 had a related patch set uploaded (by Arlolra):
T96555: Ignore self-closed tags when extending source

https://gerrit.wikimedia.org/r/326890

Change 326890 merged by jenkins-bot:
T96555: Ignore self-closed tags when extending source

https://gerrit.wikimedia.org/r/326890