Page MenuHomePhabricator

Parsoid output on [[en:Microsoft]] is not valid XML
Closed, ResolvedPublic

Description

Parsoid's output is supposed to be both valid HTML and valid XML. But https://rest.wikimedia.org/en.wikipedia.org/v1/page/html/Microsoft/653066299 is not valid XML, because it contains a comment like <!-- foo--bar --> which is valid HTML but not valid XML.

This breaks https://en.wikipedia.org/wiki/Microsoft?oldid=653066299&veaction=edit in Internet Explorer, because we use an XML parser to work around IE's many attribute normalization bugs.

> $.ajax({ url: 'https://rest.wikimedia.org/en.wikipedia.org/v1/page/html/Microsoft/653066299', dataType: 'text' } ).done( function(text) { window.doc = new DOMParser().parseFromString(text, 'text/xml'); console.log(window.doc); });
Object {readyState: 1, getResponseHeader: function, getAllResponseHeaders: function, setRequestHeader: function, overrideMimeType: function…}

> window.doc.getElementsByTagName('parsererror')[0].innerHTML
"<h3 xmlns="http://www.w3.org/1999/xhtml">This page contains the following errors:</h3><div xmlns="http://www.w3.org/1999/xhtml" style="font-family:monospace;font-size:12px">error on line 102 at column 1469: Comment not terminated 
&lt;!-- Generally we stick to products that are in the cu
</div><h3 xmlns="http://www.w3.org/1999/xhtml">Below is a rendering of the page up to the first error.</h3>"

> window.xml.indexOf('Generally we stick')
83002
> window.xml.slice(82990, 83200)
"/span> <!-- Generally we stick to products that are in the current annual report here--if you wish to add one that is not you need to provide a reference for it -->For the 2010 <a rel="mw:WikiLink" href="./Fisc"

Event Timeline

Catrope raised the priority of this task from to Needs Triage.
Catrope updated the task description. (Show Details)
Catrope added projects: Parsoid, Parsoid-DOM.
Catrope subscribed.

We could fix this on the way out of Parsoid, but, we'll probably emit a normalized comment on the way in for non-selsered portions. I think that is a reasonable solution for now.

We could fix this on the way out of Parsoid, but, we'll probably emit a normalized comment on the way in for non-selsered portions. I think that is a reasonable solution for now.

Sounds good to me.

ssastry renamed this task from [[en:Microsoft]] is not valid XML to Parsoid output on [[en:Microsoft]] is not valid XML.Mar 26 2015, 11:59 PM
ssastry triaged this task as Medium priority.
ssastry set Security to None.
ssastry moved this task from Needs Triage to In Progress on the Parsoid board.

http://www.w3.org/TR/html5/syntax.html#comments
http://www.w3.org/TR/html5/syntax.html#comment-start-state

Another issue with not respecting the comment tokenizing algorithm came up during the deploy this week, kowiki/문지애

echo "<!-->{{hi}}<-->" | node parse --wt2wt
<!----><nowiki>{{hi}}</nowiki><-->

In the rt, domino ends the comment at <!--> and then the rest gets processed as text and we get nowiki escaping.

Change 201655 had a related patch set uploaded (by Arlolra):
WIP: All kinds of comment fun

https://gerrit.wikimedia.org/r/201655

Change 201655 merged by jenkins-bot:
Normalize comments

https://gerrit.wikimedia.org/r/201655