Page MenuHomePhabricator

Parsoid will faithfully output characters from mwentity entries that are illegal in XML, such as backspace (0x08)
Closed, DuplicatePublic

Description

$ echo "Hello wold" | bin/parse.js | tee testfile
<!DOCTYPE html>
<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/"><head prefix="mwr: http://en.wikipedia.org/wiki/Special:Redirect/"><meta charset="utf-8"/><meta property="mw:pageNamespace" content="0"/><meta property="isMainPage" content="true"/><meta property="mw:html:version" content="1.3.0"/><link rel="dc:isVersionOf" href="//en.wikipedia.org/wiki/Main%20Page"/><title></title><base href="//en.wikipedia.org/wiki/"/><link rel="stylesheet" href="//en.wikipedia.org/w/load.php?modules=mediawiki.legacy.commonPrint,shared|mediawiki.skinning.content|mediawiki.skinning.content.parsoid|mediawiki.skinning.elements|mediawiki.skinning.interface|site|skins.vector.styles|ext.cite.style|mediawiki.page.gallery.styles&amp;only=styles&amp;skin=vector"/></head><body data-parsoid='{"dsr":[0,17,0,0]}' lang="en" class="mw-content-ltr sitedir-ltr ltr mw-body mw-body-content mediawiki" dir="ltr"><p data-parsoid='{"dsr":[0,16,0,0]}'>Hello wo<span typeof="mw:Entity" data-parsoid='{"src":"&amp;#x08;","srcContent":"\b","dsr":[8,14,null,null]}'</span>ld</p>

hexedit shows there's a literal \b(0x08) there:

00000420 5D 7D 27 3E 08 3C 2F 73 70 61 6E 3E 6C 64 3C 2F ]}'>.</span>ld</

This is bad, because XML parsers barf on this character, and Parsoid's output is supposed to be parseable as XML.