Page MenuHomePhabricator

Tokenizer thrown off by < char on a line
Closed, ResolvedPublic

Description

[subbu@earth lib] echo "equation C<r in <ref>foo</ref>" | node parse --normalize
<p>equation C&lt;r in &lt;ref>foo&lt;/ref></p>

Causes the diff: https://fr.wikipedia.org/w/index.php?title=%C3%89volution_de_l%27altruisme&diff=prev&oldid=112044063 as reported on the WP:VE/F page.

Event Timeline

ssastry raised the priority of this task from to Medium.
ssastry updated the task description. (Show Details)
ssastry added a project: Parsoid.
ssastry subscribed.

This is parsing as <r in="" <ref="">foo</ref> and then the r tag is being sanitized.

We changed to this behaviour in https://gerrit.wikimedia.org/r/#/c/173212/. See the output for "Handle broken pre-like tags (bug 64025)".

The parsing spec (http://www.w3.org/TR/html5/syntax.html#attributes-0) doesn't omit < from attribute names.

However, the php parser clearly differs here:

<div id="1" <div id="2">ine</div>

becomes

&lt;div id="1"
<div id="2">
ine
</div>

So, we could return to breaking on < or try and "fix" this on the php side. Any thoughts?

I verified in Firefox that it parses the div just like we do .. so, yes, the PHP parser is one that parses this differently. I wonder if it is because of HTML4 .. But, if we change the PHP parser behavior (which seems reasonable), that could break a lot of exisitng pages. So, not sure yet ..

Perhaps a grep of dumps can help inform whether we change PHP parser behavior or fix Parsoid to deviate from HTML5 syntax.