Page MenuHomePhabricator

Trimmed multibyte characters result in invalid XML
Closed, ResolvedPublic


Author: bdanee88

I'm just started to write a statistics program for Hungarian Wikipedia. While I downloaded the deletion log from January 2008, my program encountered an exception: the XML loaded from the API was bad encoded. I wondered why, so I checked it, and really, there is an error:

In element 'item' with logid 142820, the comment contains an unknown character at the end. Probably it would be a two byte length UTF-8 character, but it has been trimmed. The problem is not so serious as I can get rid of the comment attribute with using &leprop= in the URL as I don't need it, but if someone needs it, he/she won't able to load the file.

The bad line (see also in the link):
<item logid="142820" pageid="0" ns="0" title="Borisz Szpasszkij" type="delete" action="delete" user="Bináris" timestamp="2008-01-25T21:19:30Z" comment="[[Wikipédia:Homokozó|teszt]]: a lap tartalma: „Boris Vasilievich Spassky [szerkesztés] A Wikipédiából, a szabad lexikonból. Ugrás: <small>NAVIGÁCIÓ</small>, <small>KERESÉS</small> Boris V Spassky () szovjet később francia...” (és csak �"/>

Version: unspecified
Severity: normal



Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:19 PM
bzimport set Reference to bz15261.

I don't see the problem. I opened the link in Firefox (which automatically parses XML and screams if there's something wrong with it), and I got no errors. I also confirmed that logid 142820 is in there, which it is. That means it's probably your XML parser's fault; closing as WORKSFORME.

"Sorry, I am unable to validate this document because on line 44 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.

The error was: utf8 "\xE2" does not map to Unicode"

Many XML parsers choke on broken UTF-8 entities. Of course, this is mostly a database problem, but the fact that API returns ill-formed data remains.

  • Bug 16101 has been marked as a duplicate of this bug. ***

Should be fixed in r45749: invalid UTF-8 chars are replaced with the UTF-8 replacement character (U+FFFD).

(In reply to comment #7)

Not fixed:
still outputs invalid UTF-8.

Argh, array_walk_recursive() doesn't work the way I expected it to. Fixed in r47090