Page MenuHomePhabricator

Parsoid will faithfully output characters from mwentity entries that are illegal in XML, such as backspace (0x08)
Closed, DuplicatePublic

Description

$ echo "Hello wold" | bin/parse.js | tee testfile
<!DOCTYPE html>
<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/"><head prefix="mwr: http://en.wikipedia.org/wiki/Special:Redirect/"><meta charset="utf-8"/><meta property="mw:pageNamespace" content="0"/><meta property="isMainPage" content="true"/><meta property="mw:html:version" content="1.3.0"/><link rel="dc:isVersionOf" href="//en.wikipedia.org/wiki/Main%20Page"/><title></title><base href="//en.wikipedia.org/wiki/"/><link rel="stylesheet" href="//en.wikipedia.org/w/load.php?modules=mediawiki.legacy.commonPrint,shared|mediawiki.skinning.content|mediawiki.skinning.content.parsoid|mediawiki.skinning.elements|mediawiki.skinning.interface|site|skins.vector.styles|ext.cite.style|mediawiki.page.gallery.styles&amp;only=styles&amp;skin=vector"/></head><body data-parsoid='{"dsr":[0,17,0,0]}' lang="en" class="mw-content-ltr sitedir-ltr ltr mw-body mw-body-content mediawiki" dir="ltr"><p data-parsoid='{"dsr":[0,16,0,0]}'>Hello wo<span typeof="mw:Entity" data-parsoid='{"src":"&amp;#x08;","srcContent":"\b","dsr":[8,14,null,null]}'</span>ld</p>

hexedit shows there's a literal \b(0x08) there:

00000420 5D 7D 27 3E 08 3C 2F 73 70 61 6E 3E 6C 64 3C 2F ]}'>.</span>ld</

This is bad, because XML parsers barf on this character, and Parsoid's output is supposed to be parseable as XML.

Event Timeline

Catrope created this task.Jun 13 2017, 6:01 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 13 2017, 6:01 PM
Jdforrester-WMF renamed this task from Parsoid can output characters that are illegal in XML, such as \b (0x08) to Parsoid will faithfully output characters from mwentity entries that are illegal in XML, such as backspace (0x08).Jun 13 2017, 10:11 PM
Jdforrester-WMF updated the task description. (Show Details)