Page MenuHomePhabricator

20050909_pages_current.xml.gz causes XML::Parser::Expat to abort processing
Closed, ResolvedPublic

Description

Author: triddle

Description:
Hello,

The most recent dump file (20050909_pages_current.xml.gz) causes the Perl Expat module to abort early with the following
message: reference to invalid character number at line 273541, column 5, byte 24690464.

The data starting at that byte is:

&#xD801;昏]]</text>

  </revision>
</page>
<page>
  <title>Barbara Olson</title>
  <id>4195</id>
  <revision>
    <id>20104101</id>
    <timestamp>2005-08-02T08:47:35Z</timestamp>
    <contributor>
      <username>TMC1982</username>

Unfortunately this error causes Expat to throw an exception and no more processing is possible. It should be possible to
analyze the dump file, remove erroneous entries, and restart processing, but I can't help but feel the dump process should
not let this happen.

This can be verified with Parse::MediaWikiDump available via CPAN.

Tyler Riddle


Version: 1.6.x
Severity: blocker
URL: http://mail.wikipedia.org/pipermail/wikitech-l/2005-September/031514.html

Details

Reference
bz3473

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:48 PM
bzimport set Reference to bz3473.

See the URL, a thread opened by Jakob Voss about the problem
and also give a way to manually fix it.

Assigning bug to Brion as he is currently writing the mwdumper.

from wikitech-l:

Brion Vibber wrote:

Now filed as http://bugzilla.ximian.com/show_bug.cgi?id=76095
Will see about fixing...

Have submitted a patch. The next dump should be correct.

Patch accepted into Mono subversion repository. These guys are fast. :)

  • brion vibber (brion @ pobox.com)

As above, already fixed. Next dump will be correct. (You can filter this dump if you
need to.)

  • Bug 3478 has been marked as a duplicate of this bug. ***

plugwash wrote:

what are entities doing in the dump in the first place?

surely literal unicode in the wikitext should become literal unicode in the dump
and entities in the wikitext should be escaped when put into the dump so they
won't be changed to literal unicode by the xml parser.

See the link to the Mono bug above.