Page MenuHomePhabricator

XML Error when trying to import wikipedia data using mwdumper
Closed, DeclinedPublicBUG REPORT

Description

Hi, my setup is as follows:

Ubuntu
wikipedia xml dump file: enwiki-20170501-pages-articles-multistream.xml.bz2
java version "1.6.0_41" (also tried 1.7 with same results)
mwdumper downloaded here: https://dumps.wikimedia.org/tools/mwdumper.jar

command: java -jar mwdumper.jar --format=sql:1.5 enwiki-20170501-pages-articles-multistream.xml.bz2 | mysql -u root -p wiki

output:
Exception in thread "main" java.io.IOException: XML document structures must start and end within the same entity.

at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:92)
at org.mediawiki.dumper.Dumper.main(Dumper.java:142)

Caused by: org.xml.sax.SAXParseException; lineNumber: 45; columnNumber: 1; XML document structures must start and end within the same entity.

at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.endEntity(Unknown Source)
at org.apache.xerces.impl.XMLDocumentScannerImpl.endEntity(Unknown Source)
at org.apache.xerces.impl.XMLEntityManager.endEntity(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:392)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
... 1 more

Event Timeline

Aklapper changed the subtype of this task from "Task" to "Bug Report".Feb 6 2022, 5:56 PM
hashar subscribed.

mwdumper is no more able to process dump generated since MediaWiki 1.31 (released in June 2018). The tool started in 2005 and is no more maintained, it is thus being archived, see T351228 for reference.