Tokenizer thrown off by < char on a line
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• ssastry
	Feb 23 2015, 3:53 PM

Description

[subbu@earth lib] echo "equation C<r in <ref>foo</ref>" | node parse --normalize
<p>equation C&lt;r in &lt;ref>foo&lt;/ref></p>

Causes the diff: https://fr.wikipedia.org/w/index.php?title=%C3%89volution_de_l%27altruisme&diff=prev&oldid=112044063 as reported on the WP:VE/F page.

Event Timeline

• ssastry created this task.Feb 23 2015, 3:53 PM

• ssastry raised the priority of this task from to Medium.

• ssastry updated the task description. (Show Details)

• ssastry added a project: Parsoid.

• ssastry subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 23 2015, 3:53 PM

This is parsing as <r in="" <ref="">foo</ref> and then the r tag is being sanitized.

We changed to this behaviour in https://gerrit.wikimedia.org/r/#/c/173212/. See the output for "Handle broken pre-like tags (bug 64025)".

The parsing spec (http://www.w3.org/TR/html5/syntax.html#attributes-0) doesn't omit < from attribute names.

However, the php parser clearly differs here:

<div id="1" <div id="2">ine</div>

becomes

&lt;div id="1"
<div id="2">
ine
</div>

So, we could return to breaking on < or try and "fix" this on the php side. Any thoughts?

Arlolra claimed this task.Feb 23 2015, 10:50 PM

I verified in Firefox that it parses the div just like we do .. so, yes, the PHP parser is one that parses this differently. I wonder if it is because of HTML4 .. But, if we change the PHP parser behavior (which seems reasonable), that could break a lot of exisitng pages. So, not sure yet ..

Arlolra added a subscriber: cscott.Feb 23 2015, 10:53 PM

Perhaps a grep of dumps can help inform whether we change PHP parser behavior or fix Parsoid to deviate from HTML5 syntax.

This was fixed by https://gerrit.wikimedia.org/r/#/c/217888/

Tokenizer thrown off by < char on a lineClosed, ResolvedPublicActions

Description

Event Timeline

Tokenizer thrown off by < char on a line
Closed, ResolvedPublic
Actions