Author: triddle
Description:
Hello,
The most recent dump file (20050909_pages_current.xml.gz) causes the Perl Expat module to abort early with the following
message: reference to invalid character number at line 273541, column 5, byte 24690464.
The data starting at that byte is:
�昏]]</text>
</revision> </page> <page> <title>Barbara Olson</title> <id>4195</id> <revision> <id>20104101</id> <timestamp>2005-08-02T08:47:35Z</timestamp> <contributor> <username>TMC1982</username>
Unfortunately this error causes Expat to throw an exception and no more processing is possible. It should be possible to
analyze the dump file, remove erroneous entries, and restart processing, but I can't help but feel the dump process should
not let this happen.
This can be verified with Parse::MediaWikiDump available via CPAN.
Tyler Riddle
Version: 1.6.x
Severity: blocker
URL: http://mail.wikipedia.org/pipermail/wikitech-l/2005-September/031514.html