Page MenuHomePhabricator

XML contains invalid character from page title
Closed, InvalidPublic

Description

Some characters permitted in page titles are invalid XML characters, but titles are embedded directly in XML attributes. For example, the article "Wolfson Children’s Hospital" on English Wikipedia is included in this contributions listing:

http://en.wikipedia.org/w/api.php?action=query&format=xml&list=usercontribs&ucuser=Mgreason&uclimit=500&ucdir=newer&ucprop=ids|title&ucstart=2008-07-16T20:03:47Z

A strict XML parser will reject this XML because the character "’" is invalid. Attributes containing titles containing special characters need to be escaped.


Version: unspecified
Severity: normal
OS: Windows XP
Platform: PC
URL: http://en.wikipedia.org/w/api.php?action=query&format=xml&list=usercontribs&ucuser=Mgreason&uclimit=500&ucdir=newer&ucprop=ids|title&ucstart=2008-07-16T20:03:47Z

Details

Reference
bz17836

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:36 PM
bzimport set Reference to bz17836.

Never mind, I checked and 0x2019 is a valid XML character. I believe I'm doing something wrong in my API client.

Okay, sorry to double guess myself but I'm pretty sure this is because my XML parser thinks the curly quote apostrophe is trying to close the attribute value. I don't know whether the problem is on the XML generator or XML parser side, but reopening in any case.

Reclosing as INVALID. Both Firefox's XML parser and the W3C validator say that the URL you mentioned is a valid XML document.

Sorry, my mistake, there was a problem with the encoding I was using to read it. After resolving this there was no problem.