Page MenuHomePhabricator

Exception when trying to convert wikidata dump
Closed, DuplicatePublic

Description

Trying to use mwdumper to convert wikidata dump, I get this:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048
	at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
	at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
	at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
	at javax.xml.parsers.SAXParser.parse(SAXParser.java:392)
	at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
	at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
	at org.mediawiki.dumper.Dumper.main(Dumper.java:142)

Command line: java -jar ~/mwdumper-1.16.jar --format=sql:1.5 /public/dumps/public/wikidatawiki/20141106/wikidatawiki-20141106-pages-articles.xml.bz2

Event Timeline

Smalyshev raised the priority of this task from to Needs Triage.
Smalyshev updated the task description. (Show Details)
Smalyshev added a project: Utilities-mwdumper.
Smalyshev subscribed.
Aklapper triaged this task as Medium priority.Mar 2 2015, 7:46 AM
Aklapper changed the task status from Open to Stalled.Apr 23 2016, 9:05 AM

@Smalyshev: Is there any "You have an error in your SQL syntax" output too?
All stacktraces look like this, so to fix the scheme support that specific line would be very helpful...

Array out of bounds on utf8 reading sounds familiar, will do a little testing and research.

brion changed the task status from Stalled to Open.Apr 23 2016, 10:09 AM

This appears to be an old bug in the xerces XML parser library, which supposedly is fixed in more recent versions. I'll try updating the local copy.

(Essence of the bug: UTF-8 characters that spanned across buffer boundaries would sometimes cause an error when the bytes to string conversion wanted more bytes than it had available.)

Change 285004 had a related patch set uploaded (by Brion VIBBER):
Update Xerxes to 2.11.0

https://gerrit.wikimedia.org/r/285004

That changeset updates Xerces and *should* fix the bug, but I should add a test case before merging.